Residue  Number  Systems  Methods  and  Apparatuses

ABSTRACT

A method for performing reconstruction using a residue number system includes selecting a set of moduli. A reconstruction coefficient is estimated based on the selected set of moduli. A reconstruction operation is performed using the reconstruction coefficient.

CROSS REFERENCE TO RELATED APPLICATIONS

This patent claims the benefit of priority from U.S. Provisional PatentApplication Ser. No. 61/311815, entitled “Ultrafast Residue NumberSystem Using Intermediate Fractional Approximations That Are RoundedDirectionally, Scaled and Computed,” filed on Mar. 9, 2010, incorporatedherein by reference in its entirety.

FIELD OF THE INVENTION

The invention relates to performing residue number system calculations,and in particular, to reduced complexity algorithms and hardware designsfor performing residue number system calculations. This invention is inthe field of the Residue Number Systems (RNS) and their applications.

§1 BACKGROUND OF THE INVENTION

The RNS uses a set “

” of pair-wise co-prime positive integers called as the “moduli”

={m₁, m₂, . . . , m_(r), . . . , m_(k)} where m_(r)>1 ∀r ∈ [1, K] andgcd (m_(i), m_(j))=1 for i≠j (B-1)

Note that |

|=the number of moduli=K (also referred to as the number of “channels”in the literature)

We use the term “component-modulus” to refer to any single individualmodulus (ex: m_(r)) in the set

.

For the sake of convenience, we also impose an additional orderingconstraint: m_(i)<m_(j) if i<j

Total-modulus

m₁×m₂× . . . ×m_(K). Typically

>>K (B-2)

Any integer Z ∈ [0,

, can be uniquely represented by the ordered-touple (or vector) ofresidues:

∀ Z ∈ [0,

−1]; Z≡

=[z₁, z₂, . . . , z_(K)] where, z_(r)=(Z mod m_(r)), r=1, . . . , K(B-3)

Conversion from residues back to an integer is done using the “ChineseRemainder Theorem” as follows:

Z=(Z_(T) mod

) (B-4) where

$Z_{T} = \left( {\sum\limits_{r = 1}^{K}{\mathcal{M}_{r} \cdot \rho_{r}}} \right)$

(B-5)

p_(r)=((z_(r)·w_(r)) mod m_(r)), r=1, . . . , K where (B-6)

outer-weights

ℳ i = m i ;

inner-weights

$w_{i} = \left( {\frac{1}{\mathcal{M}_{i}}{mod}\mspace{11mu} m_{i\;}} \right)$

are constants for a given

(B-7)

The Residue Number Systems (abbreviated “RNS”) have been around forwhile [4]. The underlying Residue Domain representation (or simplyResidue Representation, abbreviated “RR”) has some unique attributes(explained below) that make it attractive for signal processing. It istherefore not surprising that the early work in this area wascontributed by the signal-processing community.

Thereafter, from the late 1970s through the mid 1980s, the field ofcryptology was revolutionized by the invention of the 3 most fundamentaland widely used cryptology algorithms, viz., Diffie-Hellman, RSA, andElliptic-curves. In the beginning, the aforementioned cryptologyalgorithms were not easy to implement in hardware. However, assemiconductor device sizes kept on shrinking, the hardware that could beintegrated on a single chip kept on becoming larger as well as faster.As a result, today (in 2011) it possible to easily realize thecryptographic (abbreviated “crypto”) algorithms in hardware. Theword-lengths used in crypto methods are substantially larger as comparedwith wordlengths required by other applications; typical crypto wordlengths today are at least 256 bits or higher. It turns out that thesame attributes of the Residue Number Systems that are attractive forsignal processing are also beneficial when implementing cryptographicalgorithms at long word-lengths. Consequently the cryptology andcomputer-arithmetic communities also started researching RNS. Thiscoincidental convergence of goals (to research and improve RNS, which isnow shared by the signal processing, cryptology as well ascomputational/computer arithmetic communities) has in-turn led to aresurgence of interest as well as activity in the RNS [5].

§1.1 Advantages of the Residue Domain Representation

The main advantage of the Residue Number (RN) system is that in theResidue-Domain (RD), the operations {±, ×,

} can be implemented on a per-component/channel basis, wherein theprocessing required in any single channel is completely independent ofthe processing required in any other channel. In other words, theseoperations can be implemented fully in parallel on a per-channel basisas follows:

Z = X

Y

[z_(r)=(x _(r)

y _(r)) mod m _(r))], r=1, . . . , K where

∈ {±, ×, ?}  (1)

Note that equality of two numbers can be checked by comparing theirresidues which can be done in parallel in all channels. In other words,in the RD, the most fundamental operations viz., addition/subtraction,equality check AND Multiplication can all be performed in parallel ineach channel independently of any other channel(s). This independence ofchannels implies that each of the above operations can be implementedwith O(n) computing effort (operations/steps), where

┌lg

┐=n   (2)

i.e., n is the number of bits required to represent the total-modulus

or the overall range of the RNS.

In contrast, in the regular integer domain, a multiplication is aconvolution of the digit-strings representing the two numbers beingmultiplied. A convolution is substantially more expensive thanadd/subtract operations (Addition/Subtraction fundamentally require O(n)operations and can be implemented in O(lg n) delay using the“carry-look-ahead” method and its variants. Naive paper-and-pencilmultiplication, requires O(n²) operations. Asymptotically fastestmultiply methods use transforms such as floating point FFT (Fast FourierTransform) or number-theoretic transforms to convert the convolution inthe original domain into a point-wise product in the transform domain,so that the number of operations required turns out to be ≈O(nlg n); forfurther details, please refer to [2]).

Thus, performing the multiplications in the RD is substantially fasteras well as cheaper (in terms of interconnect length and therefore h/warea as well as power consumption). Consequently, wherevermultiplication is heavily used, adopting the RR can lead to smaller andfaster realizations that also consume less power. For example:

(i) Filtering is heavily used in signal processing. Most of the effortin filtering is in the repeated multiply and add operations. It istherefore not surprising that the first practical use of the RNS was insynthesizing fast filters for signal processing.

(2) Multiplication (note that squaring is a special case ofmultiplication) also gets used heavily in long-wordlength cryptologyalgorithms. Therefore RD implementations of cryptological algorithms arealso smaller, faster and consume lower power.

§1.2 Disadvantages of the Residue Domain Representation

Together with the advantages, also come some of the disadvantages of theresidue domain: when compared to the “easy operations” above, severalfundamental operations are relatively a lot more difficult to realize inthe RD [4, 6-8]:

-   -   1. Reconstruction or conversion back to a weighted, non        redundant positional representation (ex, binary or decimal or        the “mixed-radix” representation [6])    -   2. Base extension or change.    -   3. Sign and overflow detection or equivalently, a        magnitude-comparison.    -   4. Scaling or division by a constant, wherein, the divisor is        known ahead of time (such as the Modulus in the RSA or        Diffie-Hellman algorithms)    -   5. Division by an arbitrary divisor whose value is dynamic,        i.e., available only at run-time.

Reconstruction in the regular format by directly using the CRT turns outto be a slow operation. Note that a straightforward/brute-forceapplication of the CRT entails directly implementing Equation (B-4).Accordingly, Z_(T) is fully evaluated first, and then a division by themodulus M is carried out to retrieve the remainder (Z). For longword-lengths (ex, in cryptography applications) the final division by Mis unacceptably slow and inefficient.

Re-construction by evaluating the mixed-radix representation takesadvantage of the “mixed-radix” representation associated with everyresidue-domain-representation [6], wherein a number is represented by anordered set of digits. The value of the number is a weighted sum wherethe weights are positional (just like the weights of a normal singleradix decimal or binary representation). As a result, a digit-by-digitcomparison starting with the most-significant-digit is feasible.However, to the best of our knowledge it takes O(K²) sequentialoperations (albeit on small sized operands of about the same size as thecomponent-moduli m_(i)) in the residue-domain. The inherently sequentialnature of this method makes it slow.

Moreover, at a first glance, it appears that for magnitude-comparisonand division, the operands need to be fully reconstructed in the form ofa unique digit string representing an integer either in the regular orthe mixed-radix-format.

§1.3 Related Prior Art

§1.3.1 Base-Extension or Change

In a sign-magnitude representation or a radix-complement (such as thetwo's complement) representation, a 32 bit integer can be easilyextended into a 64 bit value. The corresponding operation in the RNS isconsiderably more involved. Related Prior work in this area falls undertwo categories, each is briefly explained next.

§1.3.1.A Deploying a Redundant Modulus

Shenoy and Kumarersan [9] start by re-expressing the CRT in a slightlydifferent form:

Z=Z _(T)−α·

where 0

α

K−1

In the above equation α=

_(C) is the only unknown. It is clear that knowledge of (Z mod m_(e)),i.e., the residue/remainder of Z, w.r.t. one extra/redundant modulusm_(e) is sufficient to determine the value of

_(C). They assume the availability of such an extra residue, which letsthem evaluate α=

_(C).

This base extension method has been widely adopted in the literature.For example, Algorithms for modular multiplication developed by Bajardet. al. [10, 11] perform their computations in two independent RNSsystems and change base from one to the other using the shenoy-kumaresanmethod. This is done so as to avoid a full reconstruction atintermediate steps. As a result, they end up requiring a base-conversionin each step and consequently, their algorithm requires O(K) units ofdelay when O(K) dedicated processing elements are available (where K=thetotal number of moduli or channels in the RNS system).

§1.3.1.B Iterative Determination of

_(C)

Another base-extension algorithm related to our work is described in[12-15]. They show a method to evaluate an approximate estimate

in a recursive, bit-by-bit (i.e., one bit-at-a-time) manner and thenderive conditions under which the approximation is error-free. Thismethod is at the heart of their base-extension algorithm.

The recursive structure of this method makes it relatively slower andcumbersome.

The idea of using the “fractional-representation” of CRT has been aroundfor a while. For instance, Vu [16, 17] proposed using a Fractionalinterpretation of the CRT in the mid 1980s. However, he ends up using avery high (actually the FULL) precision: [lg (K·

)] bits (see equations (13) and (14) in reference [17]).

§1.3.2 Sign Detection and Magnitude Comparison

In the RNS, the total range is divided into positive and negativeintervals of the same length (to the extent possible). For example, ifthe set of RNS moduli is

={2, 3, 5, 7} then

=210 and the overall range of the RNS is [−105,+104], where,

the numbers 1 through 104 represent +ve numbers, and

the numbers 105 thru 209 represent −ve numbers from −105 to −1,respectively.

In general, all negative values in the range [−(

−1), −1] satisfy the relation

(−α) mod

≡

−α  (3)

Sign detection in the RNS is not straightforward, rather, it has beenknown to be relatively difficult to realize in the Residue Domain.

Likewise, comparison of magnitudes of two numbers represented as residuetouples is also not straightforward (independent of whether or notnegative numbers are included in the representation). For instance, withthe same simple moduli set

={2,3,5,7} above, note that 18≡(1, 1, 4, 5) and 99≡(1,0,4,1) while79≡(1, 1, 4, 1) and the negative number −101≡(1,1,4,4).

In other words, the touples of remainders corresponding to +ve and −venumbers cannot be easily distinguished.

Prior Work on Sign Detection and Magnitude Comparison

Sign detection operation has been known to be relatively difficult torealize in the RNS for a while (early works date back to 1960's, forexample [4,18]). Recent works related to Sign detection in RNS havetended to focus on using moduli having special forms [19, 20], whichlimits their applicability. The idea of “core-functions” was introducedin [21] in the context of coding-theory. RNS sign-detection algorithmsbased on idea of using “core-functions” have been published [22-24].However, these methods are unnecessarily complicated and appear to beuseful only with moduli with special properties [24], limiting theirapplicability.

Lu and Chiang [25, 26] introduced a method to use the least significantbit (lsb) to keep track of the sign. However, tracking the lsb ofarbitrary (potentially all possible) numbers is not an easy task. Intheir quest to keep track of the lsb, Lu and Chiang first proposed anexhaustive method in their first publication [25]; which turns out to beinfeasible for all but small toy examples because the size of theirlook-up table was the same as the total range M. In the follow uppublication, they abandoned the exhaustive look-up approach [26] andended up unnecessarily using the full precision, just as Vu does in hiswork [17].

§1.3.3 Scaling or Division by a Constant

In general, “scaling” includes both multiplication as well as divisionby a fixed constant, (viz., the scaling factor S_(f)). Early versions ofsignal processors often deployed a fixed-point format which necessitatedscaling to cover a wider dynamic range of input values. Consequently,scaling has been heavily used in signal processing. It is therefore notsurprising that the early work in realizing the scaling operation in theresidue-domain comes from the signal-processing community [27, 28].

Shenoy and Kumaresan [29, 30] introduced a scaling method that worksonly if the constant divisor has the special form

D=m _(d) ₁ ·m _(d) ₂ . . . m _(d) _(s) wherein s<K and

{m _(d) ₁ , m _(d) _(s) . . . m _(d) _(s) }⊂{m ₁ , m ₂ , . . . , m_(k)}=

=the set of RNS moduli   (4)

i.e., the divisor is a factor of the overall modulus M. This restrictionrenders their method inapplicable in most cryptographic algorithms;because the modulus (aka, the constant divisor) N is either a largeprime number (as in elliptic curve methods) or a product of two largeprimes numbers (as in RSA). In either case, it does not share a factorwith the total modulus

.

All methods and apparata for scaling in the RNS that have been publishedthus far [23, 31-33]; including more recent ones [33-35] are eitherlimited to special moduli or are more involved than necessary becausethey all attempt to estimate the remainder first, subtract it off andthen arrive at the quotient, which is the quantity of interest inscaling. Consequently, none of the methods or apparata are even remotelysimilar to the new algorithm that I have invented for RNS division by aconstant.

§2 SUMMARY OF THE INVENTION

The following presents a simplified summary in order to provide a basicunderstanding of some aspects of the claimed subject matter. Thissummary is not an extensive overview, and is not intended to identifykey or critical elements, or to delineate any scope of the disclosure orclaimed subject matter. The sole purpose of the subject summary is topresent some concepts in a simplified form as a prelude to the moredetailed description that is presented later. In one exemplary aspect, amethod for performing reconstruction using a residue number system isdisclosed. A set of moduli is selected. A reconstruction coefficient isestimated based on the selected set of moduli. A reconstructionoperation is performed using the reconstruction coefficient. In anotherexemplary aspect, an apparatus for performing reconstruction using aresidue number system includes means for selecting a set of moduli,means for estimating a reconstruction coefficient based on the selectedset of moduli and means for performing a reconstruction operation usingthe reconstruction coefficient. In yet another exemplary aspect, acomputer program product comprising a non-volatile, computer-readablemedium, storing computer-executable instructions for performingreconstruction using a residue number system, the instructionscomprising code for selecting a set of moduli, estimating areconstruction coefficient based on the selected set of moduli andperforming a reconstruction operation using the reconstructioncoefficient is disclosed. In yet another exemplary aspect, a method forperforming division using a residue number system comprises selecting aset of moduli, determining a reconstruction coefficient and determininga quotient using an exhaustive pre-computation and a look-up strategythat covers all possible inputs. In yet another exemplary aspect, amethod of computing a modular exponentiation in a residue number systemincludes iterating, without converting to a regular integerrepresentation, by performing modular multiplications and modularsquaring and computing the modular exponentiation as a result of theiterations.

§3 BRIEF DESCRIPTION OF DRAWINGS

FIG. 1: shows summation of fraction estimates (obtained vialook-up-tables) to estimate the Reconstruction Coefficient.

FIG. 2: is a flow chart for the Reduced Precision Partial Reconstruction(“RPPR”) algorithm.

FIG. 3: is a schematic block diagram of a generic architecture toimplement the RPPR algorithm.

FIG. 4: illustrates conventional method of incorporating negativeintegers in the RNS.

FIG. 5: illustrates sign and Overflow Detection by Interval Separation(SODIS).

FIG. 6: Flow chart for the Quotient First Scaling (QFS) algorithm.

FIG. 7: is a schematic timing diagram for the QFS algorithm.

FIG. 8: is a flow chart for the modular exponentiation algorithm

FIG. 9: is a flow chart representation of a process of performingreconstruction using a residue number system.

FIG. 10: is a block diagram representation of a portion of an apparatusfor performing reconstruction using a residue number system.

FIG. 11: is a flow chart representation of a process of performingdivision using a residue number system.

FIG. 12: is a block diagram representation of a portion of an apparatusfor performing division using a residue number system.

FIG. 13: is a flow chart representation of a process of computing amodular exponentiation using a residue number system.

FIG. 14: is a block diagram representation of a portion of an apparatusfor computing a modular exponentiation using a residue number system.

§4 DETAILED DESCRIPTIONS

In this section, first, I explain my moduli-selection method. Afterthat, each new algorithm illustrated in detail. Since the “RPPR”algorithm is used in all others, it has been explained in more detailthan other algorithms.

Notations Used in this Document

Notations-1 Math Functions, Symbols

The symbol ≡ means “equivalent-to”; whereas the symbol

means “is defined as”

LHS

Left Hand Side of a relation; RHS

the right-hand-side

a mod b

the remainder when integer a is divided by integer b.

ulp≡wight or value a unit or a “1” in the least-significant-place

gcd

greatest common divisor (also known as highest common factor or hcf)

lg≡log-to-base 2, ln≡log-to-base-c, log≡log-to-base-10

floor function: └x┘=the largest integer

x≡Round to the nearest integer toward−∞

ceiling function: ┌x┐=the smallest integer

x≡Round to the nearest integer toward+∞

truncation: trunc(x)=only the integer-part of x≡Round toward 0

O( )≡Order-of or the big-O function as defined in the algorithmsliterature (for example see [2]). |•|≡cardinality if argument is a set;≡absolute value of integer argument.

“RR” is abbreviation for “Residue Representation”, “RD” is abbreviationfor “Residue Domain”, “integer-domain” refers to the set of all integers

.

denotes the “equality-check” operation.

Notations-2 Algorithm Pseudo-Code

The pseudo-code syntax closely resembles MAPLE [3] syntax.

Lines beginning with # as well as everything between /* and */ arecomments.

All entities/variables with a bar on top are vectors/ordered-touples(ex, Z≡[z₁, . . . , z_(K)])

Operations that can be implemented in parallel in all channels are showninside a square/rectangular box. for example Z= X

Y

[z_(r)=(x_(r) ‡ y_(r)) mod)], r=1, . . . , K

Brief Introduction to RNS and Canonical Definitions

Definition 1: We define “Reconstruction-Remainders” to be thecomponent-wise values p₁, p₂, . . . , p_(K) defined by relations (B-6)above.

Note that Equation (B-4) can be re-written as Z_(T)=Z+Q·M (B-8.1) orequivalently as Z=Z_(T)−Q·M where, (B-8.2)

Q = ⌊ Z T ⌋ = Quotient

when Z_(T) is divided by

_(C)

0

_(C)

K−1 (B-9)

Definition 2: We define the coefficient of

(which is denoted by the variable Q) in Equations (B-8*) to be the“Reconstruction-Coefficient” and henceforth denote it by the dedicatedsymbol “

_(C)”

Definition 3: Full reconstruction of the integer corresponding to aresidue-touple refers to the process of retrieving the entire uniquedigit-string representing that integer in a non-redundant,weighted-positional format (such as two's complement or decimal or themixed-radix format).

D-4: Full-precision≡d_(T) base-b digits; where d_(T)=┌log_(b)(

K)┐≈┌log_(b)

┐

n_(b); since

>>K (B-10)

If the base b=2 then n_(b)=n₂ is the bit-length required to representthe total modulus

or the overall range of RNS; and is therefore also denoted simply by thevariable n without any subscript.

Definition 5: Any method/algorithm that simply determines the value of

_(C) without attempting to fully reconstruct Z is referred-to as a“Partial-Reconstruction” (PR).

Evaluating

_(C) yields an exact equality (Eqn. (B-8.2)) for the target integer Z,without any “mod” i.e., remaindering operations in it (unlike thestatement of CRT, Eqn. (B-4)). For most operations (especially divisionwith a constant) such an exact equality for Z suffices, i.e., there isno need to fully reconstruct Z. This is why Partial-Reconstruction(i.e., evaluating

_(C)) is an important enabling step underlying most other operations

§4.1 Moduli Selection

Note that a modulus of value m_(r) needs a table with (m_(r)−1) entriesto cover all possible values of the reconstruction-remainder p_(r)w.r.t. m_(r), (excluding the value 0). Therefore, the total number ofmemory locations required by all moduli is

$\begin{matrix}{{\# \mspace{11mu} {memory}\mspace{14mu} {locations}\mspace{14mu} {required}} = {{\sum\limits_{r = 1}^{K}\left( {m_{r} - 1} \right)} \approx {\sum\limits_{r = 1}^{K}{m_{r}_{}}}}} & (5)\end{matrix}$

Thus, in order to minimize the memory needed, each component modulusshould be as small as it can be.

Therefore, in order to cover a range [0, R] we select smallestconsecutive K prime numbers starting with either 2 or 3, such that theirproduct exceeds R:

={m1, m2, . . . , m_(K)}={2, 3, . . . , K-th prime number}, where

$\begin{matrix}{{\prod\limits_{t = 1}^{K}\; m_{t}} = {> R}} & (6)\end{matrix}$

This selection leads to the following two analytically tractableapproximations:

The notation defines m_(K) to be the K-th prime number. In other words,K is the index of prime number whose value is m_(K). Consequently, K andm_(K) can be related to each other via the well-known “prime-counting”function [36] defined as

$\begin{matrix}{{\pi (x)} = {\left( {{{The}\mspace{14mu} {number}\mspace{14mu} {of}\mspace{14mu} {prime}\mspace{14mu} {numbers}}x} \right) \approx {\frac{x}{\ln \; x}\mspace{14mu} {and}\mspace{14mu} {therefore}}}} & (7) \\{\mspace{79mu} {K = {{\pi \left( m_{k} \right)} \approx \left( \frac{m_{k}}{\ln \; m_{k}} \right)}}} & (8)\end{matrix}$

2

The overall modulus

becomes the well known “primorial” function [37] which for any positiveinteger N is denoted as “N#” and defined as

$\begin{matrix}\begin{matrix}{{N\#} = {{1\mspace{14mu} {if}\mspace{14mu} N} = 1}} \\{{= \left( {{{product}\mspace{14mu} {of}\mspace{14mu} {all}\mspace{14mu} {prime}\mspace{14mu} {numbers}}N} \right)},{otherwise}}\end{matrix} & (9)\end{matrix}$

(Note that a the definition as well as the notation for the primorial isanalogous to the well known “factorial” function (N!)). The primorialfunction satisfies well-known identities [37, 38]

2^(N)<(N#)<4^(N)=2^(2N) and   (10)

(N#)≈O(e ^(N)) for large N   (11)

As a result, to be able to represent n bit numbers (i.e. the range[0,2^(n)−1]), in the residue domain using all available prime numbers(starting with 2), the total modulus satisfies

≈exp (m _(K))>2^(n) and therefore   (12)

m _(K)≈(ln

)=ln(2^(n))=n·ln2   (13)

Substituting this value of m_(K) in Eqn. (8), K, the number of modulirequired to cover all “n”-bit long numbers can be approximated as :

$\begin{matrix}{K = {{\pi \left( m_{k} \right)} \approx \left( \frac{m_{k}}{\ln \; m_{k}} \right) \approx \frac{n\; \ln \; 2}{\ln \left( {n\; \ln \; 2} \right)} \approx {O\left( \frac{n}{\ln \; n} \right)} \approx {O\left( \frac{\lg \; }{\left( {\ln \left\lbrack {\lg \; \mathcal{M}} \right\rbrack} \right)} \right)}}} & (14)\end{matrix}$

These analytic expressions are extremely important because they imply:

A.1 K<m _(K)<<

(which follows from relations (13) and (14) above.) Moreover, both  (15)

A.2 maximum-modulus m_(K) as well as the number of moduli K growlogarithmically w.r.t.

  (16)

i.e., linearly w.r.t. the wordlength n (since n=┌lg

┐).

§4.1.1 moduli selection enables exhaustive look-up strategy that coversall possible inputs

The attributes A.1 and A.2 make it possible to exhaustively deploypre-computation and lookup because they guarantee that the total amountof memory required grows as a low degree polynomial of the wordlength n.

In other words, the main novelty in my method of moduli selection andits real significance is the fact that I leverage the selection toenable an exhaustive pre-computation and look-up strategy that coversall possible input cases. This exhaustive pre-computation and look-up inturn makes my algorithms extremely simple, efficient and thereforeultrafast because I deploy the maximum amount of pre-computationpossible, and perform as much of the task ahead of time as possible; sothat there is not much left to be done dynamically at run-time (aperfect example of this is the new “Quotient First Scaling” algorithmfor RNS division by a constant divisor that is explained in detail inSection §4.5 below).

In other words, the “minimization” of the total number of look-up tableentries is the best possible scenario, but it is not necessary to obtainthe major benefits that are illustrated for the first time in thisinvention. There is a lot more flexibility in selecting the moduli aslong as they do not make it infeasible to deploy the exhaustiveprecomputation strategy.

Consider a concrete example: Suppose the claims section says “selectmoduli so as to minimize the total amount of look-up table memoryrequired.”

If the desired range is all 32-bit numbers, then the set of moduli

={2,3,5,7, 11,13, 17,19, 23,29} minimizes the total number of look-uptable entries required.

Now, one can replace any component modulus from the above set (forexample, say the modulus 29) with another prime number (such as 31, 37or even 101). The resulting moduli set does not minimize the totalnumber of look-up table entries required, but it is sufficiently closeand would not make much of a difference in the ability to deploy theexhaustive precomputation strategy. In the strict sense, however, themodified moduli set does not satisfy the “minimization” criteria andthis fact might be used to wiggle around having to acknowledge the useof intellectual property claimed by this patent.

We would therefore like to clarify that the spirit of this part of theinvention (i.e., the moduli selection method) can be better captured bythe following description:

Select the set moduli so as to simultaneously bound both

(i) m_(k)=the maximum value in the set of moduli

by a low degree polynomial in n, as well as

(ii) K=the total number of moduli in the set

=

(also known as the number of RNS channels) by another low degreepolynomial in the wordlength n.

(both polynomials could be identical which is a special case). Followingusual practices, we consider any polynomial of degree

16 to be a low-degree polynomial.

In closing we would like to point out some additional benefits of ourmoduli selection:

-   -   +1: This selection is general, in the sense that for any value        of R multiple moduli sets always exist.    -   +2: The moduli are relatively easy to find, since prime numbers        are sufficiently densely abundant irrespective of the value of        R.    -   +3: It fully leverages the parallelism inherent in the RNS    -   +4: limiting m_(K) and K to small values makes it more likely        that the entire RNS fits in a single h/w module.

§4.2 The Reduced Precision Partial Reconstruction (“RPPR”) Algorithm

This is a fundamental algorithm that underlies all other algorithms tofollow. To speed-up the Partial-Reconstruction, we combine theinformation contained in both integer as well as fractional domains. Weexpress the CRT in the form:

Z T = R C + Z ℳ = ( ∑ r = 1 K  ρ r m r )  = Δ  S   where , f r = ρr m r ⇒ ( 17 ) C = ⌊ S ⌋ = the   integer   part   of   the  sum   of   fractions , and ( 18 ) Z ℳ = ( S - ⌊ S ⌋ ) = the  fractional   part   of   the   sum   of   fractions ( 19 )

Relation (18) states that

_(C) can be approximately estimated as the integer part of a sum of atmost K proper fractions f_(r), r=1, . . . , K. (proper fractions becausethe numerator p_(r) is a remainder w.r.t. m_(r); and therefore it isstrictly less than m_(r)).

To speed up such an estimation of the

_(C), we leverage pre-computation and look-up: for each modulus m_(r),we pre-calculate the value of each of the (m_(r)−1) fractions, i.e., allpossible fractions that can occur, and store them in the look-up table(denoted by

m_(r))

$\begin{matrix}{_{m_{r}} = {\left. \left\lbrack {\frac{1}{m_{r}},\frac{2}{m_{r}},\ldots \mspace{14mu},\frac{\mathcal{M}_{r} - 2}{m_{r}},\frac{m_{r} - 1}{m_{r}}} \right\rbrack \Rightarrow{{the}\mspace{14mu} i\text{-}{th}\mspace{14mu} {entry}\mspace{14mu} {in}\mspace{14mu} {the}\mspace{14mu} {table}} \right. = {{_{m_{r}}\lbrack i\rbrack} = {f_{i,r} = \frac{i}{m_{r}}}}}} & (20)\end{matrix}$

(if p_(r)=0 then the table entry is 0 which need not be explicitlystored).

The important point is that The look-up table for each modulus m_(r) canbe accessed independent of (and therefore in parallel with) the look-uptable for any other modulus m_(s) where r≠s.

The fractional values (obtained from the tables) are then added up asillustrated in FIG. 1

§4.2.1 Derivation of the Algorithm and Novel Aspects Therein

{circle around (1)} First, note that we need to estimate the integerpart of a sum of fractions, i.e., we need to be able to accuratelyevaluate the most-significant digits/portion of the sum as illustratedin FIG. 1. The important point is that whenever a computation needs togenerate the “most-significant” bits/digits of the target, approximationmethods can be used. For instance, in a division, “Quotient” is a loteasier to approximate than the “Remainder”.

In other words, using the rational-domain interpretation allows us tofocus on values that represent the “most-significant” bits/digits of thetarget and therefore approximation methods can be invoked.

{circle around (2)} The implication is that the precision of theindividual fractional values that get added need not be very high. Allthat is required is that the fractions

$f_{i,r} = \frac{i}{m_{r}}$

be calculated to enough precision so that when they are all addedtogether, the error is small enough so as not to reach up-to and affectthe least significant digit of the integer part (to the extentpossible).

Let the radix/base of the number representation be b and let w_(f) bethe number of fractional (radix-b) digits required. Then, for eachfraction f_(i) we generate an upper and lower bound as follows:

$\begin{matrix}{{{{For}\mspace{14mu} \rho_{i}} \neq 0},{{{let}\mspace{14mu} \frac{\rho_{i}}{m_{i}}} = {f_{i}\mspace{14mu} {the}\mspace{14mu} {exact}\mspace{14mu} {value}\mspace{14mu} {of}\mspace{14mu} {the}\mspace{14mu} {reconstruction}\text{-}{fraction}\mspace{14mu} {in}\mspace{14mu} {channel}\mspace{14mu} i}}} & (21)\end{matrix}$

f _(i)=0.d ₁ d ₂ . . . d _(w) _(f) |d _(w) _(f) ₊₁ d _(w) _(f) ₊₂   (22)

Truncation of f _(i) to w _(f) digits yields an under-estimate:{circumflex over (f)} _(i) _(—) _(low)=0.d ₁ d ₂ . . . d _(w) _(f)

f _(i)   (23)

and 0

(f _(i) −{circumflex over (f)} _(i) _(—) _(low))=0.0 . . . 0d _(w) _(f)₊₁ d _(w) _(f) ₊₂ . . . <1/b ^(w) ^(f)   (24)

However, a ceiling or rounding-toward-∞ to retain w_(f) fractionaldigits adds a ulp to the least significant digit (lsd), yielding anover-estimate: (25)

 f ^ i_high =  0 · d 1  d 2   …  [ ( d w f ) + 1 ] =  f ^ i_low +ulp  f i   where   ulp = 1 b w f ( 26 )  and   0  ( f ^i_high - f i ) < 1 / b w f ( 27 ) combining   ( 23 )   and   ( 26)   we   get   f ^ i_low  f i  f ^ i_high = ( f ^ i_low + ulp )( 28 ) ( f ^ l_low + … + f ^ K_low )  ( f 1 + … + f K ) = C + Z  ( f ^l_low + … + f ^ K_low ) + n z · ulp ( 29 )  where , n z = number_of _nonzero  _residues  _in  _the  _touple ( 30 )

The understand the upper limit in relation (29) above, note that eachnon-zero p_(i) makes the corresponding over-estimate higher than theunder-estimate by a ulp, as per Eqns (21), (23) and (26).

Let {circumflex over (S)}_(low)

({circumflex over (f)}_(l) _(—) _(low)+ . . . +{circumflex over (f)}_(K)_(—) _(low)) and {circumflex over (I)}_(low)

└{circumflex over (S)}_(low)┘=integer part of {circumflex over(S)}_(low)   (31)

{circumflex over (S)}_(high)

{circumflex over (S)}_(low)+n _(z)·ulp and {circumflex over (I)}_(high)

└{circumflex over (S)}_(high)┘=integer part of {circumflex over(S)}_(high)   (32)

Taking the “floor” of each expression in the inequalities in relations(29) above; substituting the floors from Eqns (31) and (32); and usingthe identity

 ⌊ C + Z ⌋ =  C   we   obtain ( 33 )  I ^ low   C  I ^ high  so   that ( 34 ) if   I ^ low = I ^ high   then   the  estimate    C = I high = I low   must   be   exact ( 35 )

since both upper and lower bounds converge to the same value. Inpractice (numerical simulations), this case is encountered in anoverwhelmingly large fraction of numerical examples. Moreover,

 Since   n z  K ⇒ n z · ulp  K · ulp ( 36 ) by   selecting   wf   so   as   to   ensure   K · ulp = K / b w f < 1   from  ( 29 )   we   get ( 37 )  S ^ low  C + Z M  S ^ low + 1  taking   the   floor   of   each   expression   in   the  relation   above   yields ( 38 )  I ^ low   C  I ^ low + 1 (39 )

In other words, even in the uncommon/worst cases, wherein,Î_(low)≠Î_(high), relation (39) demonstrates that the estimate of

_(C) can be quickly narrowed down to a pair of consecutive integers.

{circle around (3)} It is intuitively clear that further“disambiguation” between these choices needs at least one bit of extrainformation. This information is obtained from the value (Z mod m_(e))where m_(e) is the extra modulus. For efficiency, m_(e) should be assmall as possible. Accordingly, our method leads to only two scenarios:

[a] if

is odd then m_(e)=2 is sufficient for disambiguation

[b] otherwise if 2 is included in the set of moduli

, then m_(e)=4 is sufficient for disambiguation

m_(e) ∈ {2,4} (this is analytically proved in [39])   (40)

Note that when

includes “2” as a modulus, it already contains the value (z₁=Z mod 2),i.e., the least significant bit of the binary representation of Z. Thevalue (Z mod 4) therefore conveys only one extra bit of informationbeyond what the residue touple conveys.

{circle around (4)} It is reasonable to assume that for primary/externalinputs the extra-info is available. The exhaustive pre-computations canalso assume that the extra-info is available. Starting with these, wegenerate the extra-bit of information (either explicitly or implicitly)for every intermediate value we calculate/encounter. This is done in aseparate dedicated channel. Let

W=*(X) where * is a unary operation. Then,   (41)

W mod m _(e)=[*(X mod m _(e))] mod m _(e) for * ∈ {left-shift,power}  (42)

If the operation is a right shift, then finding the remainder of theshifted value w.r.t. m_(e), is slightly more involved but it can beevaluated using a method identical to “Quotient_First_Scaling”, i.e.,“Divide_by_Constant', which explained in detail in Section §4.5 below.

Likewise, let

Z=X

Y where

is a binary operation. Then,   (43)

Z mod m _(e)=[(X mod m _(e))

(Y mod m _(e))] mod m _(e) for

∈ {±, ×}  (44)

Finally, since division is fundamentally a sequence of shift andadd/subtract operations, as long as we keep track of the remainder ofevery intermediate value w.r.t. m_(e), we can also derive the values of(Quotient mod m_(e)) and (Remainder mod m_(e)). Thus all the basicarithmetic operations are covered.

§4.2.2 Analytical Results

Result 1 Pre-conditions: Let the radix of the original (non-redundant,weighted and positional, i.e., usual) number representation be denotedby the symbol “b” (note that b=10 yields the decimal representation, b=2gives the binary representation). Suppose integer Z=[z₁, z₂, . . . ,z_(K)] is being partially re-constructed and theextra-bit-of-information, i.e., the value of (Z mod m_(e)) is alsoavailable. Let

Z T ≈ S ^ = I ^ + F ^ = ∑ r = 1 K  f ^ r   where , ( 45 ) I ^ = ⌊ S ^⌋ = the   integer   part   of   the   approximate   sum  S ^ , and ( 46 ) F ^ = S ^ - I ^ = the   fractional   part   of  the   sum , and ( 47 ) f ^ i =  f ^ i_low =  Trunc  [ w F ]  ( fi ) =  truncation   of   f i   to   w F   digits   where (48 ) f i = ρ i m i ⇒ 0  f i < 1 ( 49 )

and p_(i) values are the reconstruction-remainders defined in Equation(B-6)

Let   δ = ( Z T - S ^ )   be   the   total   error   in  the   approximate   estimate   S ^ ( 50 ) and   let   the  Reconstruction  -  Coefficient    C   be   estimated   as  C ≈ I ^   then , ( 51 )

Result 1: In order to narrow the estimate of the ReconstructionCoefficient

_(C) down to two successive integers, viz., Î or (Î+1), it is sufficientto carry out the summation of the fractions (whose values can obtainedfrom the look-up-tables) in a fixed-point format with no more than atotal of w_(T) radix-b digits, wherein

w _(T) =w _(I) =w _(F) where   (52)

w_(I)=Number of digits allocated to the Integer part, and   (53)

w_(F)=Number of digits allocated to hold the fractional part   (54)

where the precisions (i.e., the digit lengths) of the integer andfractional parts satisfy the conditions:

R1.1 w _(I)=┌log_(b) K┐  (55)

R1.2 w _(F)=┌log_(b) (K·Δ _(uuzf))┐ where, K=number of moduli=

, and   (56)

Δ_(uuzf)≡“Unicity Uncertainty Zone Factor”, that satisfies Δ_(uuzf)

2   (57)

R1.3 The Rounding mode adopted in the look-up-tables (when limiting thepre-computed values of the fractions to the target-precision) as well asduring the summation of fractions as per equation (51) must beTRUNCATION, i.e., discard excess bits.

For the proof of the above result as well as all other analyticalresults stated below, please refer to [39].

Result 2: In order to disambiguate between the two possible values ofthe Reconstruction Coefficient i.e., select the correct value (Î) or(Î+1), a small amount of extra information is sufficient.

R2.1 In particular, (prior) knowledge of the remainder of Z (the integerbeing partially reconstructed), w.r.t. one extra component modulus m_(e)that satisfies

gcd(

, m _(e))<m _(e) is sufficient for the disambiguation.   (58)

R2.2 For computational efficiency, the minimum value of m_(e) thatsatisfies (58) should be selected.

Such a selection gives rise to the following two canonical cases:

-   -   {circle around (1)}        is odd: In this case, m_(e)=2 is sufficient for disambiguation.    -   {circle around (2)}        contains the factor 2: then, m_(e)=4 is sufficient for        disambiguation.

§4.2.3 RPPR Algorithm: Illustrative Examples

The pre-computations and Look-up-tables needed for the partialreconstruction are illustrated next.

§4.2.3.A First an Example with Small Values, to Bootstrap the Concepts

Let the set of moduli be

={3,5,7,11}, so that, K=4,

=1155, and let m_(e)=2

TABLE 1 Look-up table for the RPPR algorithm for the RNS-ARDSP systemwith

 = {3, 5, 7, 11}. This table uses the value of ρ_(r) as the address tolook-up the fraction $\left( \frac{\rho r}{m_{r}} \right)$

 explicit value of ρ_(r) is needed. In this case K = 4 and Δ_(uuzf) = 2and therefore w_(I) = ┌log₁₀(4 − 1)┐ = 1 integer decimal-digit and w_(F)= ┌log₁₀(4 × 2)┐ = 1 fractional decimal-digit. Accordingly, all theentries in the table have a single fractional digit. In this toy examplewe have deliberately left the table entries in the fixed-pointfractional form for the sake of clarity (rather than scaling them by thefactor 10¹ and listing them as integers). modulus${table}\mspace{14mu} {entries}\mspace{14mu} {for}\mspace{14mu} {row}\mspace{14mu} {m_{r}:\left. {{column}\mspace{14mu} i}\;\leftarrow\frac{i}{m_{r}} \right.}$↓ m_(r) 1 2 3 4 5 6 7 8 9 10  3 → 0.3 0.6  5 → 0.2 0.4 0.6 0.8  7 → 0.10.2 0.4 0.5 0.7 0.8 11 → 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Then, the look-up table for the RPPR algorithm is shown in Table 1.

The table consists of 4 subtables (one per-channel/modulus) that areindependently accessible in parallel.

For each value of m_(r), the table consists of a row that simply storesthe approximate pre-computed values of the fractions

$\frac{1}{m_{r}},\frac{2}{m_{r}},\ldots \mspace{14mu},{\frac{m_{r} - 1}{m_{r}}.}$

§4.2.3.B Further Optimization: Skip the Computation of p_(r) Values

Note that instead of explicitly calculating p_(r) and then using it asan index into a table, the residue z_(r) could be directly used as anindex into a table that stores the appropriate values of the precomputedfractions.

$\begin{matrix}{\frac{\rho_{r}}{m_{r}} = \left( \frac{\left( {\left( {w_{i} \times z_{r}} \right){mod}\; m_{r}} \right)}{m_{r}} \right)} & (59)\end{matrix}$

The resulting table is illustrated in Table 2.

In this toy example the number fractional digits required forintermediate computations is 1 which is not a sizable reduction from thefull precision which is 3 digits.

The the following non-trivial long-wordlength example demonstrates thetruly drastic reduction in precision that is afforded by our novelalgorithm.

TABLE 2 Residue Addressed Look-up Table (RAT) for the RPPR algorithm forthe RNS-ARDSP system with

 = {3, 5, 7, 11}. This table is a further optimized version of Table 1above: here, the residue value z_(r) is directly used as the address ofa location that stores the corresponding value of$\left( \frac{\left( {\left( {w_{i} \times z_{r}} \right)\mspace{11mu} {mod}\mspace{14mu} m_{r}} \right)}{m_{r}} \right)$in the rth row (i.e., sub-table) for component- modulus m_(r).Calculation of ρ_(r) is not required when this table is used. modulus${table}\mspace{14mu} {entries}\mspace{14mu} {for}\mspace{14mu} {row}\mspace{14mu} {m_{r}:\left. {{column}\mspace{14mu} i}\;\leftarrow\frac{\left( {\left( {i \times w_{i}} \right)\mspace{11mu} {mod}\mspace{14mu} m_{r}} \right.}{m_{r}} \right.}$↓ m_(r) 1 2 3 4 5 6 7 8 9 10  3 → 0.3 0.6  5 → 0.2 0.4 0.6 0.8  7 → 0.20.5 0.8 0.1 0.4 0.7 11 → 0.1 0.3 0.5 0.7 0.9 0.0 0.2 0.4 0.6 0.8

Note that the only difference between this table and Table 1 is apermutation of the entries in sub-tables for those moduli m_(i) forwhich the “inner-weights” are larger than unity (in this case w_(i)>1for the last two rows corresponding to moduli 7 and 11).

§4.2.3.C A Nontrivial Long Word-Length Example

Now consider the partial reconstruction of numbers with aword-length=256 bits. Here the range is R=2²⁵⁶.

In this case, first 44 prime-numbers are required to cover the entirerange. Therefore K=44 and

={2, 3, 5, 7, 11, . . . , 181, 191, 193}, m_(e)=4, and the product ofall the component-moduli is

=198962376391690981640415251545285153602734402721821058212203976095413910572270and the ratio

$\left( \frac{\mathcal{M}}{2^{256}} \right) \approx 1.718$

m_(k)=m₄₄=44 th prime number=193

The word-length required to represent

is

n=┌log ₁₀ (

)┐=┌77.298┐=78 decimal digits and the word length required for Z_(T) is  (60)

d _(T)=┌log₁₀(

×44)┐)=┌78.9┐=79 digits   (61)

Hence, any conventional full re-construction method to evaluate

_(C) requires at least a few operations on 79 digit-long integers.

In contrast, our new partial-reconstruction method requires adrastically smaller precision as well as a drastically small number ofsimple operations (only additions) to accurately evaluate thereconstruction coefficient

_(C). As per Result R1.2 above, the number of fractional digits requiredin the look-up-tables as well as in intermediate computations is only

w _(F)=┌log₁₀(44×2)┐=2 decimal fractional digits.   (62)

When addition of such fractions is considered, the integer part that canaccrue requires no more than 2 additional digits (since we are adding atmost K=44 values, and each is a proper fraction their sum must be lessthan 44 and therefore requires no more than 2 decimal digits to storethe integer-part).

Therefore, the total number of digits required in all intermediatecalculations is as small as 4, which is a drastic reduction from 79. (Ingeneral the reduction in precision is from O(lg

) required by conventional methods versus the much smaller amount O(lglg

) required by our method).

Another extremely important point: by appropriate scaling, all the fixedpoint fractional values in the table can be converted into integers.Correspondingly, all the fixed-point computations (additions andsubtractions of these fractions) are also scaled and can therefore berealized as integer-only operations.

The obvious scaling factor is 10^(w)F. The resulting look-up-table thatcontains the scaled integers as its entries is illustrated in Table 3.

TABLE 3 The Redsidue Addressed Table RAT for the RPPR algorithm for theRNS-ARDSP system with first 44 prime numbers as moduli, i.e.,

 = {2, 3, 5, 7, 11, . . . , 191, 193}. The fixed point truncated valuesof the fractions are scaled by a factor of C_(s) = 10². componentmodulus${Table}\mspace{14mu} {entries}\mspace{14mu} {for}\mspace{14mu} {row}\mspace{14mu} {m_{r}:\left. {{column}\mspace{14mu} i}\;\leftarrow\left\lfloor \frac{\left\lbrack {\left( {\left( {i \times w_{r}} \right)\mspace{11mu} {mod}\mspace{14mu} m_{r}} \right) \times C_{s}} \right\rbrack}{m_{r}} \right\rfloor \right.}$↓ m_(r) 1 2 . . . 189 190 191 192  2 → 50  3 → 66 33 . . . . . . . . . .. . 191 → 71 42 . . . 57 28 193 → 77 55 . . . 44 22

Note that un-scaling requires a division. But since the scaling factoris a power of the radix of the underlying number representation,un-scaling can be achieved simply by left-shift and truncation ofintegers. Thus, with the scaling, floating point computations areentirely avoided.

Next we formally specify the algorithm and simultaneously illustrate itfor two examples:

1. Example 1: find

_(C) for value X=641 in the small wordlength case. inputs: X=[3, 4, 1,2] (note that the fully reconstructed value for this touple viz., “641”is not known to the algorithm. It is only given the touple) and theextra-info value (X mod m_(e)=1));

2. Example 2: find

_(C) for the value “X=1” in the long=wordlength case. (inputs: X=[1, 1,. . . , 1, 1] and (X mod m_(e)=1));

Right below every step of the algorithm, the computations actuallyperformed for each of the two examples are also illustrated inside“comment-blocks”

§4.2.4 Specification of the Algorithm via Maple-Style Pseudo-Code

Algorithm Reduced_Precision_Partial_Reconstruction( Z, z_(e)) # Inputs:residue-touple Z = [z₁, z₂, . . . , z_(j)], extra-info z_(e) = (Z modm_(e)), m_(e) ε {2, 4} # Output: Exact value of the ReconstructionCoefficient 

_(c) # Pre-computation : moduli,

, m_(e), all constants (ex, 

_(j),w_(j) = ((1/ 

_(j)) mod m_(j)) mod m_(j), . . . )) # create (Reconstruction_Table(s) ); # Step 1: using z_(r) as the indexes, look up ultra low precisionestimates {circumflex over (f)}_(r), r = 1 . . . K #   Note that thiscan be done in parallel in all channels nz := 0; for i from 1 to K do  #for each channel i   if z_(r) = 0 then     {circumflex over (f)}_(r) :=0;     n_(z) _(r) := 1;   else     {circumflex over (f)} _(r) := z_(r)thelement in the look-up-table for m_(r);     n_(z) _(r) := 0;   f i;    #same as “end if” od;   # same as “end for” # Example 1: K = 4 valuesread from Table 1 above (and scaled by the factor C_(s) = 10) = [5, 1,2, 6] # Example 2: K = 44 pre-scaled values read from Table 2 above =[50, 61, . . . , 71, 77] # Step 2: Sum all the {circumflex over (f)}_(r)values with a total of only w_(T) digits of precision to obtain thebounds ${{\hat{S}}_{low}:={\sum\limits_{r = 1}^{K}\; {\hat{f}}_{r}}};$${n_{z}:={\sum\limits_{r = 1}^{K}\; n_{z_{r}}}};{and}$ Ŝ_(high) :=Ŝ_(low) + n_(z); # Example 1: Ŝ_(low) = 5 + 1 + 2 + 6 = 14 and Ŝ_(high)= 14 + 4 = 18 # Example 2: Ŝ_(low) = (50 + 61 + . . . + 71 + 77) = 2581and Ŝ_(high) = 2581 + 44 = 2625 # Step 3: unscale and take the floor ofthe bounds to obtain integer bounds on 

_(C) # note that these can be realized as a right-shift followed bytruncation${\hat{I}}_{low}:={\left\lfloor \frac{{\hat{S}}_{low}}{C_{s}} \right\rfloor \mspace{14mu} {and}}$${\hat{I}}_{high}:=\left\lfloor \frac{{\hat{S}}_{high}}{C_{s}} \right\rfloor$${\# \mspace{14mu} {Example}\mspace{14mu} 1\text{:}\mspace{14mu} {\hat{I}}_{low}} = {\left\lfloor \frac{14}{10} \right\rfloor = {1\mspace{14mu} {and}}}$${\hat{I}}_{high} = {\left\lfloor \frac{18}{10} \right\rfloor = 1}$${\# \mspace{14mu} {Example}\mspace{14mu} 2\text{:}\mspace{14mu} {\hat{I}}_{low}} = {\left\lfloor \frac{2589}{100} \right\rfloor = {25\mspace{14mu} {and}}}$${\hat{I}}_{high} = {\left\lfloor \frac{2625}{100} \right\rfloor = 26}$# Step 4: check if upper & lower integer bounds have same value. if yes,return it as the correct answer if (Î_(low) == Î_(high)) then   Return(Î_(low)); fi; # Example 1: both upper and lower bounds converge to thesame value 1

 correct value of 

_(C) = 1, and is returned # Example 2: Bounds do not converge to thesame value  

 need to disambiguate between {25, 26} using extra info # Step 5:disambiguate using extra-info if (Z_(T) mod m_(e) = {(Î_(low) ·

) mod m_(e) + Z mod m_(e)} mod m_(e)) then   Ans := Î_(low); else   Ans:= Î_(high); end if ; # Example 2: it can be verified that: (Z_(T) mod4) = 1 ≠ ((25 × 2) + 1) mod 4 = 3 mod 4 but # (Z_(T) mod 4) 

 ((26 × 2) + 1) mod 4 = 1 mod 4 Return (Ans) ;   # Output = correctvalue of  

_(C) End_Algorithm

The correctness of the algorithm can be proved by invoking Results 1, 2and other equations and identities presented in this document. Inaddition, the algorithm has been implemented in Maple (software) andexhaustively verified for small wordlengths (upto 16 bits). A largenumber of random cases (>10⁵) for long wordlengths (up to2²⁰≈million-bits) were also run and verified to yield the correctresult.

§4.2.5 RPPR Architecture

FIG. 3 illustrates the block diagram of an architecture to implement theRPPR algorithm. The main goal of the architecture is to fully leveragethe parallelism inherent in the RNS. There are K channels, each capableof performing all basic arithmetic operations, viz., {±, ×, division,shifts, powers, equality-check, comparison, . . . } modulo m_(r) whichis the component-modulus value for that particular channel.

In addition, each channel is also capable of accessing it's ownlook-up-table(s) (independent of other channels). Finally there is adedicated channel corresponding to the extra modulus m_(e)=2 or m_(e)=4that evaluates Z mod m_(e) for every non-primary integer Z (non-primaryrefers to a value that is not an external input or is not one of theprecomputed values).

We would like to emphasize that the schematic diagram is independent ofwhether the actual blocks in it are realized is in hardware or software.The parallelism inherent in the RNS is independent of whether it isrealized in h/w or s/w. This should be contrasted with some otherspeed-up techniques (such as rendering additions/subtractionsconstant-time by deploying redundant representations) that areapplicable only in hardware [40].

§4.2.6 Delay Models and Assumptions

In order to arrive at concrete estimates of delay, we assume a fullydedicated h/w implementation. Each channel has its own integer ALU thatcan perform all operations modulo any specified modulus. Among all thechannels, the K-th one that performs all operations modulo-m_(K)requires the maximum wordlength since m_(K) is the largestcomponent-modulus.

The maximum channel wordlength is: n _(K) =lg m _(K) ≈O(lg n)≈lgln

≈lglg

  (63)

Note that this is drastically smaller than the wordlength n_(c) requiredfor conventional binary representation, which is roughly O(n), thenumber of bits required to represent

.

In accordance with the literature, we make the following assumptionsabout delays of hardware modules

<A-1> A carry-look-ahead adder can add/subtract two operands within adelay that is logarithmic w.r.t. the wordlength(s) of the operands.

<A-2> Likewise, a fast hardware multiplier (which is essentially a fastmulti-operand accumulation tree followed by a fast carry-lookahead-adderand therefore) also requires a delay that is logarithmic w.r.t. thewordlength of the operands.

More generally, a fast-multi-operand addition of K numbers each of whichis n-bits long requires a delay of

O(lgK)+O(lg(n+lgK)) which becomes ≈O(lgn) in our case.   (64)

<A-3> Assuming that the address-decoder is implemented in the form of a“tree-decoder”, a look-up table with

entries requires ≈O(lg

) delay to access any of it's entries.

<A-4> We assume that dedicated shifter(s) is(are) available. Amulti-stage-shifter (also known as a “bar-rel” shifter [1, 6[)implements shift(s) of arbitrary (i.e., variable) number of bit/digitpositions, where the delay is ≈O(lg(maximum_shift_distance_in_digits))units.

§4.2.7 Estimation of the Total Delay

The preceding assumptions, together with Equation (63) imply that thedelay Δ_(CH) all operations within individual channels can beapproximated to be

Δ_(CH) =lg m _(K) ≈O(lglgn)≈lglglg

  (65)

which very small.

The delay estimation is summarized in Table 4.

TABLE 4 ESTIMATION OF THE DELAY REQUIRED BY THE RPPR ALGORITHM AlgorithmStep no: can individual Approximate Delay and operation(s) channels workas a function of performed in parallel? wordlength n Justification 1:Compute or look-up ρ_(r) values yes O(lg lg n) Equation (65) 2: Usingρ_(r) as the index yes O(lg lg n) Equation (65) look up estimates{circumflex over (f)}_(r) 3: Add all the estimates No O(lg K) ≈Assumption <A-2> O(lg n) and Equation (64) 4: Un-scale the sum back NoO(lg lg n) realized via a shift and and truncate truncation 5: Check ifupper and lower bounds No O(1) obvious, equality check converge to thesame value on small values 6: Disambiguation No O(lg lg n) m_(e) ε {2,4}

tiny operands Overall delay ≡ Latency O(lg n) dominant “functional”component

As seen in the table, the dominant delay is in Step 3, the accumulationof values read from the per-channel look-up tables.

Therefore, the overall delay is ≈O(lgn) ≈O(lglg

)

§4.2.8 Memory Required by the PR Algorithm

The r-th channel associated with modulus m_(r) has its own Look-up tablewith (m_(r)−1) entries (since the case where the remainder is “0” neednot be stored). Hence, the number of storage locations needed is

$\begin{matrix}{{\sum\limits_{r = 1}^{K}\left( {m_{r} - 1} \right)} < {K \cdot m_{K}} \approx {O\left( K^{2} \right)} \approx {O\left( n^{2} \right)}} & (66)\end{matrix}$

Each location stores a fractional value that is no longer than w_(F)digits ≈O(lgn) bits. Therefore, total storage (in bits)=O(n²)(locations)×O(lgn) (bits per locations) ≈O(n²lgn) bits (67)

There are several important points to note:

1

Although the above estimate makes it look as-though it is a single chunkof memory, in reality it is not. Realize that each channel has its ownmemory that is independently accessible. The implications are:

2

The address-selector (aka the decoder) circuitry is substantiallysmaller and therefore faster (than if the memory were to be one singleblock).

3

It is a READ-ONLY memory, the precomputed values are to be loaded onlyonce, they never need to be written again. In a dedicated VLSIimplementation, it would therefore be possible to utilize highlyoptimized, smaller and faster cells.

§4.3 Base Change/Extension

Those familiar with the art will realize that once the “RPPR” algorithmyields an exact equality for the operand (being partiallyre-constructed), a base-extension or change is straightforward.

Without loss of generality, the algorithm is illustrated via an examplewhich extends a randomly generated 32-bit long unsigned integer to a64-bit integer (without changing the value), which requires an extensionof the residue touple as shown below.

In order to cover the (single-precision) range ┌0, 2³²┐, the moduli setrequired is

=

₃₂={2,3,5,7,11,13,17,19,23,29} so that K=|

|=10, and total product

=6469693230, the reconstruction weights are

ℳ i = m i ,

i=1, . . . , 10 =[3234846615, 2156564410, 1293938646, 924241890,588153930, 497668710, 380570190, 340510170, 281291010, 223092870] andthe inner weights are

w i = ( 1 i )  mod   m i ,

i =1, . . . , 10 =[1,1,1,3,1,11,4,9,11,12]

In order to cover the double-precision range [0, 2⁶⁴], the extendedmoduli set required is

_(ext)=

₆₄={2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53} so that K_(ext)=

_(ext)|=16, and total product

_(ext)=32589158477190044730

Let the single precision operand be Z=1355576195≡Z=[1,2,0,1,6,12,3,10,10,21] and z_(e)=Z mod 4=3. The CRT expresses thevalue Z in the form

Z=(3234846615×p ₁+2156564410×p ₂+ . . . +223092870×p ₁₀)−6469693230

_(C) _(z)   (68)

where, p_(i)=(z_(i)×w_(i)) mod m_(i) is the reconstruction remainder forchannel i, for i=1, . . . , 10 and

_(C) _(z) =the reconstruction coefficient for Z

The only unknown in Eqn (68) is

_(C) _(z) which can be determined using the “RPPR” algorithm yielding anexact integer equality, enabling a straightforward evaluation of theextra-residues needed to extend the residue touple. For example thefirst extra-modulus in the example at hand is m₁₁=31. Accordingly, Z_(ext) [11]=z ₁₁=(Z mod 31)

Note that (3234846615 mod 31), . . . , (6469693230 mod 31) are allconstants for a given RNS and can be pre-calculated and stored. Ingeneral we always assume that whichever values can be pre-computed areactually pre-computed. Thus

θ_((i,j ))=

_(i) mod m _(ej) for i=1, . . . , K and j=K+1, . . . , K _(ext)   (69)

are all pre-computed and stored, so that the operation of evaluating theremainder w.r.t. an extra modulus m_(e) _(r) in the extended RNS system(such as the modulus 31 in the running example at hand) simplifies to

Z mod m _(e) _(r) =(θ_((1,e) _(r) ₎ p ₁+ƒ_((2,e) _(r) ₎ p ₂+ . . .+θ_((K,e) _(r) )p _(K)−Δ_(e) _(r)

_(C) _(z) )mod m _(e) _(r)   (70)

where, Δ_(e) _(r) =

mod m_(e) _(r)

Next, we specify the algorithm exactly in Maple-style pseudo-code.

§4.3.1 Specification of the Algorithm via Maple-Style Pseudo-Code

Algorithm Base_extension_using_RPPR_method( Z,z_(e))/*  Inputs : residue-touple Z = [z₁,z₂,...,z_(K)], extra-info z_(e) =(Z mod m_(e)), m_(e) = 4   corresponding to a 32 bit unsigned integerZ  */ # Output:  residue touple for the 64-bit extension of Z/*  Pre-computation : original moduli-set

₃₂ = {2,3,...,29}, |

₃₂| = K₃₂ = 10   extended moduli-set 

₆₄ = {2,3,...,53}, |

₆₄| = K₆₄ = 16   all constants (ex, 

_(j),w_(j) = ( (1/

_(j)) mod m_(j),...))   Reconstruction_Table(s) for both

₃₂ and

₆₄, etc.  */ # Step 1:  evaluate 

_(C) _(z ) using the “RPPR” algorithm

_(C) _(z  ) = Reduced_Precision_Partial_Reconstruction( Z,z_(e)) ; #Step 2:  evaluate the extra residues as per Eqns (68) and (70) # Thiscan be executed in parallel in all channels corresponding to theextension moduli for i from 1 to K_(e) do  # in each extension channel i  sum := 0;   for j from 1 to K_(e) do    sum := sum + ρ_(j) × θ_(i,j) ;  od;   z_(i) := (sum − Δ_(i )

_(C) _(z ) ) mod m_(i) ;    Z:= concat( Z, z_(i)) ;    # append the newresidue to the   residue touple od ; Return ( Z)   □ End_Algorithm

§4.4 Sign and Overflow Detection by Interval Separation (SODIS)

The next canonical operation I have sped-up is sign-detection. The newalgorithms for sign-detection in the RNS are illustrated in thissection.

As illustrated in FIG. 4, fundamentally, the RNS representation doesallow a separation of positive and negative integers into distinctnon-overlapping regions (what is meant here is that the RNS mapping isnot so strange as to “mix” positive and negative numbers throughout theentire range. It takes an “interval” (namely the interval including allnegative integers) and faithfully (i.e., without changing the length ofthe interval) simply translates (or displaces) it into another another“interval” which is not surprising since the “mapping” corresponding tothe translation is the simple first degree equation describing the“modulo” operation, i.e., Eqn. (3))

Q: Where then is the problem in sign detection?

A: (i) Note that a “re-construction” of the overall magnitude isnecessary

-   -   (ii) For the efficiency of representation (i.e., in order not to        waste capacity) the following additional constraint is also        imposed in most RNS implementations.

F _(max) ⁻ =F _(max) ⁺+1   (71)

In other words, ALL the unsigned integers in the range [0,

−1] are utilized, not a single digit value is wasted.

An unfortunate by-product of this quest for representational efficiencyis that consecutive integers F_(max) ⁺ and F_(max) ⁻ end up havingopposite signs. Consequently, re-construction must be able todistinguish between consecutive integers, i.e., the resolution of there-construction must be full, i.e., in fractional computations

ulp < 1 ( 72 )

This, in-turn requires that all fractional computations must be carriedout to the full precision, thereby rendering them slow.

The main question therefore is whether it is possible to make do withthe drastically reduced precision we wish to deploy? and if so, thenhow?

The answer is to insert a sufficiently large “separation-zone” betweenthe positive and negative regions, as illustrated in FIG. 5.

In FIG. 5, note that

both “0” as well as

_(e) correspond to the actual magnitude 0   (73)

unsigned integers {1, . . . , F_(max) ⁺} represent +ve values and   (74)

unsigned integers {F_(max) ⁻, . . . , (

_(e)−1)} represent −ve values {−(

_(e)−F_(max) ⁻), . . . , −1}, repsectively, wherein   (75)

Unsigned Integer F_(max) ⁺ represents the maximum positive magnitudeallowed, and   (76)

Unsigned Integer F _(max) ⁻ represents the max. −ve magnitude allowed=(

_(e) −F _(max) ⁻)   (77)

The interval between F_(max) ⁺ and F_(max) ⁻ is the separation zone  (78)

Most practical/useful number representations try to equalize the numberof +ve values and the number of −ve values included in order to attainmaximal amount of symmetry. This yields the constraint

F _(max) ⁻≈

_(e) −F _(max) ⁺  (79)

Intuitively, it is clear that equal lengths should be allocated to

(1) the +ve interval

(2) the −ve interval and

(3) the separation-zone between the opposite polarity intervals.

For this to be possible, the extended modulus

_(e) must satisfy

_(e)>3·F _(max) ⁺  (80)

Finally, the attainment of maximal possible symmetry dictates that tothe extent possible, the separation-zone must be symmetrically splitacross the mid-point of the range [0, (

_(e)−1)]. Note that FIG. 5 incorporates all these symmetries.

With the separation interval in place, note that all +ve numbers Z⁺within the range [1,F_(max) ⁺] when represented as fractions of thetotal magnitude

_(e) now satisfy

0 < Z + e < 1 3 ( 81 )

Likewise, all −ve numbers Z⁻ when represented as fractions of the totalmagnitude M_(e) satisfy

2 3 < Z - e < 1 ( 82 )

But Eqn (19) (repeated here for the sake of convenience) states that

Z = ( S - ⌊ S ⌋ ) = the   fractional   part   of   the   sum  of   fractions ≈ ∑ r = 1 K  f r ^ ( 83 )

In other words, the separation interval enables the evaluation of thesign of the operand under consideration by examining one (or at mosttwo) most significant digits of the accumulated sum of fractions.

Recall that in the partial reconstruction, the integer part of thesum-of-fractions was of crucial importance.

It is quite striking that when the interval separation is properlyleveraged as illustrated herein, the most significant digit(s) of thefractional part also convey equally valuable information, viz., the signof the operand.

As per Equations (81) and (82) the natural choice of the detectionboundaries T⁺ and T⁻ is specified by the relations

T + e = 1 3 = 0.333   …   and ( 84 ) T - e = 2 3 = 0.666   … ( 85)

However, note that even if the “detection boundaries” T⁺(for +venumbers) and T⁻(for −ve numbers) are moved slightly into the “separationzone”, as illustrated in FIG. 5, the sign-detection outcome does notchange.

For the ease of computation, we therefore set

T + e = 4 10 = 0.4   and ( 86 ) T - e = 6 10 = 0.6 ( 87 )

§4.4.1 Specification of Sign-Detection Algorithm via Maple-StylePseudo-Code

Algorithm Eval_Sign ( Z, z_(e)) # Input(s): Given an integer Zrepresented by the residue touple/vector Z := [z₁, . . . , z_(K)] # andone extra value, viz., z_(e) = (Z mod m_(e)) where m_(e) = 4 /*Output(s): the Sign of Z, defined as $\begin{matrix}{{{Sign}(Z)} = \left\{ \begin{matrix}0 & {{{if}\mspace{14mu} Z} = 0} \\{+ 1} & {{{if}\mspace{14mu} Z} > 0} \\{- 1} & {otherwise}\end{matrix} \right.} & (88)\end{matrix}$ The algorithm also returns two more values in addition tothe sign (i) the value of the reconstruction coefficient 

_(C) for the input and (ii) Approx_overflow_estimate which is a flagdefined as follows: $\begin{matrix}{{{Approx\_ overflow}{\_ estimate}} = \left\{ \begin{matrix}1 & {{if}\mspace{14mu} {overflow}\mspace{14mu} {is}\mspace{14mu} {detected}\mspace{14mu} {for}\mspace{14mu} {sure}} \\0 & {otherwise}\end{matrix} \right.} & (89)\end{matrix}$ further computation is needed to determine whether thereis an overflow in the 2nd case above Pre-computation: Everything neededfor the “RPPR” algorithm, and in addition F_(max) ⁺, F_(max) ⁻, decisionboundaries T⁺ and T⁻, etc. */ # Step 1: Look up pre-stored estimatesf_(r), r = 1 . . . K for i from 1 to K do # for each channel i   ifz_(r) = 0 then     {circumflex over (f)}_(r) := 0;     n_(z) _(r) := 1;  else     {circumflex over (f )}_(r) := z_(r)-th element in thelook-up-table for m_(r);     n_(z) _(r) := 0;   end if od; # Step 2: Sumall the f_(r) values with only w_(T) total digits${{\hat{S}}_{low}:={\sum\limits_{r = 1}^{K}\; {\hat{f}}_{r}}};$${n_{z}:={\sum\limits_{r = 1}^{K}\; n_{z_{r}}}};{{{and}\mspace{14mu} {\hat{S}}_{high}}:={{\hat{S}}_{low} + n_{z}}};$if (n_(z) == K)   then # all components = 0 

 Z = 0   Return(0, 0, 0); fi; # Step 3: unscale and separate integer andfractional parts${{\hat{I}}_{low}:=\left\lfloor \frac{{\hat{S}}_{low}}{C_{s}} \right\rfloor};{{\hat{F}}_{low}:=\left( {{\hat{S}}_{low} - {\hat{I}}_{low}} \right)};{and}$${{\hat{I}}_{high}:=\left\lfloor \frac{{\hat{S}}_{high}}{C_{s}} \right\rfloor};{{\hat{F}}_{high}:=\left( {{\hat{S}}_{high} - {\hat{I}}_{high}} \right)};$# important substitutions {circumflex over (F)} = {circumflex over(F)}_(low); and Î = Î_(low); # Step 4: determine the temporary signApprox_overflow_estimate := 0; if ({circumflex over (F)} < T⁺) then  Temp_Sign := +1; else if ({circumflex over (F)} > T⁻) then   Temp_Sign:= −1; else   Approx_overflow_estimate := 1;   if ({circumflex over (F)}< 1/2) then     Temp_Sign := +1;   else     Temp_Sign := −1;   end if;end if; # Step 5: determine

_(C) if (Î_(high) = Î_(low)) then   

_(C) := Î else   if (Z_(T) mod 4 = {(Î ·

) mod 4 + Z mod 4} mod 4) then     

_(C) := Î;   else     

_(C) := Î + 1;   fi; fi; if (

_(C) , = Î) then   Sign := Temp_Sign; else   Sign := (−1) × Temp_Sign;fi; Return(Sign,

_(C), Approx_overflow_estimate) End_Algorithm

§ 4.4.2 Overflow Detection

Since we are dealing with sign-detection of integers, an underflow of“magnitude” simply results in the value “0”; no other action needs to betaken in case of magnitude underflow.

However, a magnitude overflow, must be detected and flagged. let A and Bbe the operands and let

denote some operation, then “overflow of magnitude” includes both cases

case 1: (A

B)>Posedge=F _(max) ⁺ and   (90)

case 2: (A

B)<Negedge=−(

−F _(max) ⁻) or dividing both sides by

  (91)

( A ⊗ B ) > F ma   x +   and ( 92 ) ( A ⊗ B ) < F ma   x - ( 93 )

However, recall that the decision boundaries T⁺ and T⁻ are shifted by asmall amount into the “separation region”. As a result, whenever inputvalues in the range [F_(max) ⁺,T⁺] or in the range [T, F_(max)] areencountered, they will be wrongly classified as being within the correctrange even though they are actually outside the designated range. Theonly solution to this problem is to separately evaluate the sign ofeither (Z−F_(max) ⁺) or (Z−F_(max) ⁻) to explicitly check for overflow.

§4.4.2.A Specification of Overflow Detection Algorithm via Maple-StylePseudo-Code

Algorithm Eval_overflow( Z, z_(e), Sign, approx_overflow) # Note thatevery invocation of this algorithm must be immediately preceded by # aninvocation of the Eval_sign algorithm /* Precomputations: same as thosefor algorithm Eval_sign Inputs: Z, z_(e), Sign of Z, approx_overflow forZ The last two values are obtained as a result of the execution of theEval_sign algorithm immediately preceding the invocation of thisalgorithm. Output(s): overflow flag defined as $\begin{matrix}{{overflow} = \left\{ \begin{matrix}1 & {{if}\mspace{14mu} {overflow}\mspace{14mu} {is}\mspace{14mu} {detected}\mspace{14mu} {for}\mspace{14mu} {sure}} \\0 & {{no}\mspace{14mu} {overflow}}\end{matrix} \right.} & (94)\end{matrix}$   */ # Step 1: handle the trivial cases first if (Sign ==0) then   Return(0); fi; if (approx_overflow == 1) then   Return(1); fi;# Step 2: Determine the argument “TZ” for auxiliary sign-detection.#  note that the residue touple TZ corresponding to the integer TZ isdirectly determined #  via component-wise subtractions in the ResidueDomain if (Sign == +1) then    TZ := Z ⊖ F_(max) ⁺ ;   # ⊖ denotescomponent-wise subtraction in the residue domain tz_(e) := (z_(e) −(F_(max) ⁺mod 4)) mod 4;    # disambiguation-bootstrapping else if (Sign== −1) then    TZ := Z ⊖ F_(max) ⁻ ;   tz_(e) := (z_(e) − (F_(max) ⁺mod4)) mod 4;  # keep track of all values modulo 4 end if ; # Step 3:determine the sign of TZ, (which is denoted by the variable S_(tz)herein S_(tz), tmp_rc, approx_overflow_tz := Eval_sign( TZ, tz_(e)); if(Sign == +1) then   if (S_(tz) == +1) then     overflow := 1;   else    Overflow := 0;   end if; else if (Sign == −1) then   if (S_(tz) ==−1) then     Overflow := 1;   else     Overflow := 0;   end if; end if;Return (overflow); End_Algorithm

Once these building blocks are specified, the overall SODIS algorithm isspecified next.

§4.4.2.B Specification of Sign and Overflow Detection Algorithm viaMaple-Style Pseudo-Code

Algorithm  Sign_and_Overflow_Detection_by_Interval_Sepation( Z, z_(e)) #this algorithm is abbreviated as “SODIS” # Inputs:  Z, z_(e) #Outputs: Sign (Z), overflow,  

_(C) _(z) Sign,  

R_(C) _(z ) , approx_overflow := Eval_sign ( Z, z_(e)) ; overflow :=Eval_overflow( Z, z_(e), Sign, approx_overflow) ;Return(Sign, overflow,  

_(C) _(z ) ) ;   □ End_Algorithm

Those familiar with the art shall realize that using the algorithmspresented in this section, a comparison of of two numbers say A and Bcan be realized extremely fast, without ever leaving the residue domainin a straightforward manner by detecting the sign of (A-B).

§4.5 The Quotient First Scaling (QFS) Algorithm for Dividing by aConstant

Assume that a double-length (i.e., 2n-bit) dividend X is to be dividedby an n-bit divisor D, which is a constant, i.e., it is known ahead oftime. The double length value X is variable/dynamic. It is either anexternal input or more typically it the result of a squaring or amultiplication of two n-bit integers. It is assumed that the extra-bitof information, i.e., the value of (X mod m_(e)) is available. Givenpositive integers X and D, a division entails computing the quotient Qand a remainder R such that

$\begin{matrix}{X = {{{{Q \times D} + {R\mspace{14mu} {where}\mspace{14mu} 0}}R < {D\mspace{14mu} {so}\mspace{14mu} {that}\mspace{14mu} Q}} = \left\lfloor \frac{X}{D} \right\rfloor}} & (95)\end{matrix}$

To derive the Division algorithm, start with the alternative form of theChinese Remainder Theorem (CRT) which expresses the target integer viaan exact integer equality of the form illustrated in Equations (B-8.*).Express the double-length Dividend X as

X = X T - · C x = ( ∑ r = 1 K  ℳ r · ρ r ) - · C x ( 96 )

where the exact value of Reconstruction Coefficient

_(C) _(x) is determined using the “RPPR” algorithm explained in Section4.2 above. In other words, there is no unknown in the above exactinteger equality expressing the value of the dividend X.

To implement Division, evaluate the quotient Q as follows:

 Q =  ⌊ X D ⌋ =  ⌊ ∑ r = 1 K  ℳ r  · ρ r D - C x · D ⌋ =  ⌊ { ∑ r= 1 K  ( Q r + R r D ) } - ( Q RC + R RC D ) ⌋ ( 97 )  = ⌊ { ∑ r = 1 K ( Q r + f r ) } - ( Q RC + f RC ) ⌋   where ( 98 )  Q r , Q RC = ⌊ℳ r · ρ r D ⌋ ,   ⌊ C x · D ⌋ = precomputed   quotient   values ,and ( 99 )  R r , R RC = ( ℳ r · ρ r - Q r · D ) ,   ( C x · - Q RC ·D ) = precomputed   remainders , and ( 100 )  f r , f RC = R r D , R RC D = remainders   expressed   as   fractions   of   the  divisor   D ( 101 )

The exact integer-quotient can be written as

$\begin{matrix}{Q = {{\left( {\sum\limits_{r = 1}^{K}Q_{r}} \right) - Q_{RC} + \left\lfloor {\left( {\sum\limits_{r = 1}^{K}f_{r}} \right) - f_{RC}} \right\rfloor} = {Q_{I} + {Q_{f}\mspace{14mu} {where}}}}} & (102) \\{{Q_{I} = {{\left( {\sum\limits_{r = 1}^{K}Q_{r}} \right) - Q_{RC}} = {{the}\mspace{14mu} {contribution}\mspace{14mu} {of}\mspace{14mu} {Integer}\text{-}{part}}}},\left( {{hence}\mspace{14mu} {the}\mspace{14mu} {subscript}\mspace{14mu} {{}_{}^{}{}_{}^{}}} \right),{and}} & (103) \\{Q_{f} = {\left\lfloor {\left( {\sum\limits_{r = 1}^{K}f_{r}} \right) - f_{RC}} \right\rfloor = {{the}\mspace{14mu} {contribution}\mspace{14mu} {of}\mspace{14mu} {fractional}\text{-}{{part}\left( {{hence}\mspace{14mu} {the}\mspace{14mu} {subscript}\mspace{14mu} {{}_{}^{}{}_{}^{}}} \right)}}}} & (104)\end{matrix}$

Since exact values of Q_(r) and Q_(RC) are pre-computed and looked-up,the value of Q_(I) in Eqn (103) above is exact. However, since we useapproximate precomputed values of the fractions truncated to drasticallysmall precision, the value of Q_(f) calculated via Eqn (104) above isapproximate. As a result, the value of Q that is calculated is alsoapproximate. We indicate approximate estimates by a hat on top, whichyields the relations:

$\begin{matrix}{\hat{Q} = {Q_{I} + {\hat{Q_{f}}\mspace{14mu} {where}}}} & (105) \\{\hat{Q_{f}} = \left\lfloor {\left( {\sum\limits_{r = 1}^{K}\hat{f_{r}}} \right) - \hat{f_{RC}}} \right\rfloor} & (106)\end{matrix}$

Our selection of moduli (explained in detail in Section 4.1 above) leadsto the fact that the number of of memory-locations required for anexhaustive look-up turns out to be a small degree (quadratic) polynomialof n=lg

. This amount of memory can be easily integrated in h/w modules intoday's technology for word-lengths up to about 2¹⁷≈0.1-Million bits(which should cover all word-lengths of interest today as well as in theforeseeable future).

Note that the Reconstruction Coefficient

_(C) _(x) can also assume only a small number of values (no more than(K-1) where K is the number of moduli as per Eqns (B-4, B-5) and (B-9)Hence, quotient values Q_(RC) and the fractions

$f_{RC} = \left( \frac{R_{RC}}{D} \right)$

can also be pre-computed and stored for all possible values theReconstruction Coefficient

_(C) _(x) can assume.

§4.5.1 Further Novel Optimizations

{circle around (1)} Store the pre-computed Quotient values directly asresidue touples

Note that the quotient values themselves could be very large (about thesame word-length as the divisor D). However, we need not store theselong-strings of quotient values, since in many applications (such asmodular exponentiation) the quotient is only an intermediate variablerequired to calculate the remainder. Obviously the extra bit ofinformation conveyed by (Q_(rs) mod m_(e)) is also pre-computed andstored together with the touple representing the exact integer quotient

$\begin{matrix}{{Q_{ir} = {{\left\lfloor \frac{r \cdot \mathcal{M}_{i}}{D} \right\rfloor \mspace{14mu} {for}\mspace{14mu} i} = 1}},\ldots \mspace{14mu},{{K\mspace{14mu} {and}\mspace{14mu} r} = 1},\ldots \mspace{14mu},\left( {m_{r} - 1} \right)} & (107)\end{matrix}$

The total memory required to store either the full-length long-integervalue or storing the residues w.r.t. the component moduli as a touple isabout the same. By opting to store only the residue-touples, weeliminate the delay required to convert integer quotient values intoresidues, without impacting the memory requirements significantly.

-   -   {circle around (2)} Only fractional remainders truncated to        drastically reduced precision O(lgK)≈O(lglg        ) need to be pre-computed and stored (exactly similar to the        “RPPR” algorithm).    -   {circle around (3)} simple scaling converts all fractional        storage/computations into integer values.

Thus, the QFS algorithm needs 2 distinct Quotient_Tables.

§4.5.2 Quotient-Tables Explained via a Small Numerical Example

I believe that the tables can be best illustrated by a concrete-smallexample. Assume that the divisor D=209=11×19 (i.e. D is representable asan 8-bit number). The dividends of interest are therefore up-to 16-bitlong numbers. In this case the moduli turn out to be [2, 3, 5, 7, 11,13, 17]. Even if the first two moduli (viz., 2 and 3) are dropped, theproduct still exceeds the desired range [0, 2¹⁶]. Therefore we select

={5,7,11,13,17}

K=5,

=85085 and the extra-modulus m _(e)=2   (108)

To realize division by this divisor D=209, the first table required isshown in Table 5.

This table is referred to as “Quotient_Table_(—)1” (or also as the“Quotient_Touples_Table”). It stores all possible values of Quotientsrequired to evaluate the first term (the sum) in Eqn (103). The entries(rows) corresponding to each component-modulus m_(r) constitute asub-table of all possible values p_(r) can assume for that value ofm_(r). For the sake of clarity, we have used a “double-line” to separateone sub-table from the next.

To illustrate the pre-computations, we explain the last sub-table, inQuotient_Table_(—)1 corresponding to the component-modulus “m₅=17”,wherein,

ℳ 17 = 17 = 5005.

This sub-table has 16 rows. The first row corresponds to p₅=1 the secondrow corresponds to p₅=2 and so on. Now, we explain each entry in thepenultimate row in Table 1 above (this row corresponds to p₅=15).

The value in the 3rd column titled “Quotient . . . ” lists the quotient,i.e.

$\left\lfloor {\frac{\mathcal{M}_{17} \times 15}{D} = \frac{5005 \cdot 15}{209}} \right\rfloor = 359.$

The next entry (within the angled-brackets

) simply lists the value of (Q₅ _(—) ₁₅ mod m_(e))=(359 mod 2)=1. The4th column stores the residue-touple [4,2,7,8,2] representing thequotient 359.

The last column in Table 1 stores the fixed point fractional remaindervalues scaled by the multiplying factor b^(w) ^(f) =10² to convert theminto integers. For instance, in the penultimate row: the actualremainder is (5005×15−359×209)=44, corresponding fractional remainder is

$\frac{44}{209} \approx {0.21052\mspace{14mu} \ldots}$

which when truncated to two decimal places yields 0.21

Accordingly,

${{trunc}\left( {\frac{44}{209} \times 10^{2}} \right)} = 21$

and this is the value stored in the last column.

TABLE 5 Quotient_Table_1 for RNS-ARDSP with moduli  

 = [5, 7, 11, 13, 17] and divisor D = 209. In this case, two digitssuffice to store the scaled fractional-remainders in the last column.modulus m_(r) ↓ ρ_(r) = [1, 2, . . . m_(r) − 1] ↓ Quotient$Q_{r} = \left\lfloor \frac{M_{r} \cdot \rho_{r}}{D} \right\rfloor$ and

Q_(r) mod 2 

moduli m_(j), j = 1 . . . K [5, 7, 11, 13, 17] Q_(r) mod m_(j), j = 1 .. . K Remainder R_(r) = M_(r) · ρ_(r) − Q_(r) · D Scaled Fractional Rem=$\left. {{trunc}\left( {\frac{R_{r}}{D} \times 10^{w_{f}}} \right)}\mspace{14mu}\downarrow \right.$5 1  81 

1 

[1, 4, 4, 3, 13] 42 2 162 

0 

[2, 1, 8, 6, 9] 84 3 244 

0 

[4, 6, 2, 10, 6] 26 4 325 

1 

[0, 3, 6, 0, 2] 68 . . . . . . . . . 17 1  23 

1 

[3, 2, 1, 10, 6] 94 2  47 

1 

[2, 5, 3, 8, 13] 89 . . . 15 359 

1 

[4, 2, 7, 8, 2] 21 16 383 

1 

[3, 5, 9, 6, 9] 15

We would like to point out that the actual full-wordlength-long integervalues of quotients Q_(r) (that are listed in column 3 in the table)need not be (and hence are not) stored in a real (h/w or s/w)implementation of the algorithm (the full decimal Q_(r) values wereincluded in column 3 in Table 1 above, merely for the sake ofillustration). In an actual implementation, only the extra-information,i.e.,

Q_(r) mod 2

values (shown inside the angled-braces

in column3) and the residue-domain touples representing Q_(r) (as shownin column 4 in the table) are stored. For example, in the penultimaterow, actual quotient value “359” need not be stored, only

359 mod 2

=

1

would be stored, together with the touple of residues of 359 w.r.t. thecomponent moduli=[359 mod 5, . . . , 359 mod 17]=[4, 2, 7, 8, 2] asshown in column 4 therein.

Next we explain Table 6, which shows Quotient_Table_(—)2 (also referredto as the “Quotient_Rc_Table”)

TABLE 6 Quotient_Table_2 for RNS-ARDSP with moduli [5,7,11, 13, 17] anddivisor D = 209.

_(C) = [1, 2, . . . K] ↓ Quotient$Q_{c} = \left\lfloor \frac{M \cdot R_{C}}{D} \right\rfloor$

Q_(c) mod 2 

moduli m_(j), j = 1 . . . K [5, 7, 11, 13, 17] Q_(c) mod m_(j), j = 1 .. . K Remainder R_(c) = 

 ·

_(C) − Q_(c) · D Scaled Fractional Remainder${{Scaled}\mspace{14mu} {Fractional}\mspace{14mu} {Rem}} = \left. {{ceil}\left( {\frac{R_{c}}{D} \times 10^{w_{f}}} \right)}\mspace{14mu}\downarrow \right.$1  407 

1 

[2, 1, 0, 4, 16] 11 2  814 

0 

[4, 2, 0, 8, 15] 22 3 1221 

1 

[1, 3, 0, 12, 14] 32 4 1628 

0 

[3, 4, 0, 3, 13] 43 5 2035 

1 

[0, 5, 0, 7, 12] 53

This table covers all possible values of the Reconstruction Coefficient

_(C) _(x) in Eqns (96)-(98). Like Table 5, the values in column 2 (i.e.,the full-wordlength-long integer values of quotient Q_(c)) are notstored in actual implementation, (they are included in the table onlyfor the sake of illustration). In actual implementations, only theresidues of Q_(c) with respect to (w.r.t.) 2 (shown inside angled braces

in column 2) and the touple of residues of Q_(c) w.r.t. thecomponent-moduli are stored as illustrated in the third column of theTable. The last column stores the fixed point fractional remaindervalues scaled by the factor 10^(w) ^(f) to convert them into integers.

Another nontrivial distinction of Quotient_Table_(—)2 from all previoustables is the fact that the fractional values in the last column arealways rounded-up (the mathematical expression uses the “ceiling”function). Note that the last term in Equations (96) and (98), has anegative sign. As a result, when rounding the fractional remainders, wemust “over-estimate” them, so that when this value is subtracted toobtain the final quotient estimate, we never over-estimate. In otherwords, the use of “ceiling” function is necessary to ensure that we arealways “under-estimating” the total quotient.

§4.5.3 Specification (Pseudo-Code) of the QFS Algorithm

Like the RPPR-algorithm, we illustrate the division algorithm with 2examples:

(i) first with small sized operands (dividend X=3249, divisor D=209) sothat the reader can replicate the calculations by hand/calculator ifneeded.

(ii) The 2nd numerical example is a realistic long-wordlength case.

Instead of separating the pseudo-code and numerical illustration, wehave waved in the numerical illustration of each step of the algorithmfor the running (small) example at hand by including the numericalcalculations into the pseudo code as comment blocks.

Algorithm Quotient_First_Scaling_Estimate ( X,

X mod m_(e))

) # Inputs: Dividend X as a residue-touple ( X = [x₁, . . . , x_(K)] and

X mod m_(e)

), where m_(e) ε {2, 4} # Pre-computations: Moduli, extra_modulus m_(e),all constants

,M_(r), w_(r), r = 1, 2, . . . , K etc. create (Reconstruction_Tables);create (Quotient_Tables); # Step 1: use the RPPR-algorithm to find theReconstruction-(Remainders & Coefficient) for X (1.1) [ρ₁, . . .,ρ_(K)], 

_(C) _(x) := RPPR ( X, m_(e),

X mod m_(e)

(1.2) nonzero_rrems := 0; (1.3) for i from 1 to Nmoduli do       ifρ_(i) ≠ 0 then nonzero_rrems := nonzero_rrems + 1; fi; od; (1.4) if (

_(C) _(x) ≠ 0) then   nonzero_rcx := 1; else   nonzero_rcx := 0; fi; #In the numerical example: ρ := [10, 2, 2, 5, 2];

_(C) _(x) := 2; nonzero_rrems := 5; nonzero_rcx := 1; # Step 2: Usingthe ρ_(i) and the the

_(C) _(x) values as “indexes”, look-up in parallel the touples T_(i)[ρ_(i)], # scaled remainders Rr_(i), QRcx, RRcx, and thecorresponding extra_info values (all added in | |) (2.1)  T _(i) :=Quo_Tab_1 (i, ρ_(i), 3]; Rr_(i) := Quo_Tab_1[i, ρ_(i), 4]; i = 1, . . ., K (2.2 )  QRcx := Quo_Tab_2[

_(C) _(x) , 3]; RRcx := Quo_Tab_2[rcx, 4]; (2.3)  extra_info_T_(i) :=Quo_Tab_1[i, ρ_(i), 2]; extra_info_QRcx := Quo_Tab_2[

_(C) _(x) , 2]; /* In the example: T ₁ := [4, 1, 8, 5, 1], T ₂ := [2, 6,7, 10, 11], T ₃:= [4, 4, 8, 9, 6], T ₄ := [0, 3, 4, 4, 1], T ₅ := [2, 1,8, 6, 9] and QRcx = [4, 2, 0, 8, 15]. (Rr₁, Rr₂, Rr₃, Rr₄, Rr₅, Rrcx) :=(47 63, 1, 78, 84, 22) Note that there is no “extra-information”associated with the fractional-remainder values  */ # Step 3: Executeaccumulations in parallel in all the RNS and extra channel(s)$\begin{matrix}{{{(3.1)\mspace{14mu} {\overset{\_}{Q}}_{I}}:={\left( {\begin{matrix}K \\ + \\{i = 1}\end{matrix}{\overset{\_}{T}}_{i}} \right)\mspace{14mu} {- \mspace{14mu} \overset{\_}{QRcx}}}};} \\{{{{(3.2)\mspace{20mu} {\hat{Q}}_{f}} = {\left( {\sum\limits_{j = 1}^{K}\; {Rr}_{j}} \right) - {RRcx}}};}\quad}\end{matrix}$/*  ±  denotes  a  component-wise  addition/subtraction  of  touples  in  parallel  in  all  the  RNS  channels.    “Σ” denotes addition of scalars in the extra channel(s) In theexample:${\overset{\_}{Q}}_{I} = {{\left\lbrack {4,1,8,5,1} \right\rbrack \; {+ \; \left\lbrack {2,6,7,10,11} \right\rbrack \; {+ \; \left\lbrack {4,4,8,9,6} \right\rbrack \; {+ \; \left\lbrack {0,3,4,4,1} \right\rbrack \; {+ \; \left\lbrack {2,1,8,6,9} \right\rbrack \; {- \; \left\lbrack {4,5,0,8,15} \right\rbrack}}}}}} = {\left\lbrack {\left( {8\mspace{14mu} {mod}\mspace{14mu} d} \right),\ldots \mspace{11mu},\left( {13\mspace{14mu} {mod}\mspace{14mu} 17} \right)} \right\rbrack = {{\left\lbrack {3,6,2,0,13} \right\rbrack \mspace{14mu} {and}\mspace{14mu} {\hat{Q}}_{f}} = {{\left( {47 + 63 + 1 + 78 + 84} \right) - 22} = {251\mspace{175mu}*\text{/}}}}}}$# Step 4: Set {circumflex over (Q)}f_unscaled := Unscaled {circumflexover (Q)}_(f). Also evaluate bounds on {circumflex over (Q)}_(f), andcheck if {circumflex over (Q)}_(f) is exact. $\begin{matrix}{{{(4.1)\mspace{14mu} \hat{Q}{f\_ unscaled}}:=\left\lfloor \frac{\left( {\hat{Q}}_{f} \right)}{b^{w}} \right\rfloor};} \\{{{(4.2)\mspace{14mu} {\hat{Q}}_{f\_ {high}}}:=\left\lfloor \frac{\left( {{\hat{Q}}_{f} + {nonzero\_ rrems} + {nonzero\_ rcx}} \right)}{b^{w}} \right\rfloor};}\end{matrix}\quad$ # here, b is the base and w is the precision 

 only left-shift followed by truncation suffices (4.3) {circumflex over(Q)}_(f)_ low := {circumflex over (Q)}f_unscaled (4.4)       fi;   if({circumflex over (Q)}_(f)_low = {circumflex over (Q)}_(f)_high) then    Q_is_exact := 1;   else   Q_is_exact := 0; $\quad\begin{matrix}{{\pounds \mspace{14mu} {In}\mspace{14mu} {the}\mspace{14mu} {example}\text{:}}\quad} \\{{{{\pounds \mspace{14mu} {\hat{Q}}_{f\_ {lo}w}}:={\left\lfloor \frac{251}{10^{2}} \right\rfloor = 2}};\mspace{14mu} {and}}} \\\left. {{{\pounds \mspace{11mu} {\hat{Q}}_{f\_ {hig}h}}:={\left\lfloor \frac{251 + 5 + 1}{10^{2}} \right\rfloor:=2}};}\Rightarrow \right. \\{{\pounds \mspace{14mu} {in}\mspace{14mu} {the}\mspace{14mu} {example}},\; {{{{Q\_ is}{\_ exact}}:=1};}}\end{matrix}$${{\# \mspace{14mu} {Step}\mspace{11mu} 5\text{:}{evaluate}\mspace{14mu} \overset{\_}{Q}\text{:}{convert}\mspace{14mu} \hat{Q}{f\_ unscaled}\mspace{14mu} {into}\mspace{14mu} a\mspace{14mu} {residue}\text{-}{touple}\mspace{14mu} {and}\mspace{14mu} {add}\mspace{14mu} {it}\mspace{14mu} {to}\mspace{14mu} {\overset{\_}{Q_{I}}(5.1)}\mspace{14mu} {\overset{\_}{Q}}_{f\_ touple}}:={{vector}\mspace{14mu} \left( {K,\left. i\rightarrow\left( {\hat{Q}{f\_ unscaled}\mspace{14mu} {mod}\mspace{14mu} m_{i}} \right) \right.} \right)}};$${(5.2)\mspace{14mu} \overset{\_}{Q}}:={{\overset{\_}{Q_{I}}\mspace{11mu} {+ \; {\overset{\_}{Q}}_{{f\_}{toupl}e}}\mspace{31mu} \# \mspace{14mu} {In}\mspace{14mu} {the}\mspace{14mu} {example}\text{:}\overset{\_}{Q}}:={{\left\lbrack {3,6,2,0,13} \right\rbrack \mspace{11mu} {+ \; \left\lbrack {2,2,2,2,2} \right\rbrack}} = {\left\lbrack {0,1,4,2,15} \right\rbrack \;\quad}}}$# Step 6: Also generate {circumflex over (Q)} mod m_(e); the“disambiguation-bootstrapping” step${{(6.1)\mspace{14mu} {extra\_ info}\_ \hat{Q}}:={\left\lbrack {\left( {\sum\limits_{i = 1}^{K}\; {{extra\_ info}{\_ T}_{i}}} \right) - {{extra\_ info}{\_ QRcx}}} \right\rbrack {mod}{\mspace{11mu} \;}m_{e}}};$(6.2) extra_info_{circumflex over (Q)} := (extra_info_{circumflex over(Q)} + {circumflex over (Q)}f_unscaled) mod m_(e) # in the example:extra_info_{circumflex over (Q)} := ((1 + 0 + 0 + 0 + 0 − 0) mod 2 + 2)mod 2 = 1 (7) Output: Return ( {circumflex over ( Q)} , ({circumflexover (Q)} mod m_(e)), Q_is_exact); End_Algorithm

/* In the example: Using the CRT, it can be verified that {circumflexover (Q)}=[0, 1,4, 2,15]≡15 and ({circumflex over (Q)} mod m_(e))

1.

It is also easy to independently check that

${\left\lfloor \frac{3249}{209} \right\rfloor = 15},$

verifying the returned value of flag Q_is_exact */ We would like toclarify some important issues regarding the QFS algorithm.

-   -   {circle around (1)} From the residue touple {circumflex over        (Q)}, returned by the algorithm, the remainder can be directly        estimated as a residue-touple; and the extra info value        ({circumflex over (R)} mod m_(e)) can also be evaluated using        the fundamental division relation (Eqn (95) above):

{circumflex over (R)}:= X

( {circumflex over (Q)}

D)   (109 )

{circumflex over (R)} mod m _(e):=[(X mod m _(e))−({circumflex over (Q)}mod m _(e))×(D mod m _(e))]mod m _(e)   (110)

-   -   {circle around (2)} Note that the input X is made available to        the algorithm only as a residue touple, not as a fully        reconstructed decimal or binary integer. In addition, one extra        bit_(—) conveyed by (X mod m_(e)) is also required by the        algorithm. Given these inputs, the algorithm generates        {circumflex over (Q)} as well as ({circumflex over (Q)} mod        m_(e)) (and therefore {circumflex over (R)} and ({circumflex        over (R)} mod m_(e)) as per Eqns (109) and (110)), thereby        demonstrating that the outputs are delivered consistently in the        same format as the inputs.    -   {circle around (3)} The integer estimate {circumflex over (Q)}        corresponding to the residue-touple {circumflex over (Q)} can        take only one of the two values

If the variable/flag “Q_is_exact” is set to the value “1”, then{circumflex over (Q)}=Q, i.e., the estimate equals the exact integerquotient. In practice (numerical experiments) this happens in anoverwhelmingly large number of cases.

otherwise, the flag Q_is_exact=0, indicating that the algorithm couldnot determine whether or not {circumflex over (Q)} is exact. (because ofthe drastically reduced precision used to store the pre-computedfractions) In this case {circumflex over (Q)} could be exact, i.e.,{circumflex over (Q)}=Q

or {circumflex over (Q)}=(Q−1), i.e., Q can under-estimate Q by a ulp.

Further disambiguation between these two values is possible bycalculating the estimated-remainder {circumflex over (R)} and checkingwhether ({circumflex over (R)}−D) is +ve or −ve

Let the exact integer remainder be denoted by R. It is clear that theestimated integer-remainder {circumflex over (R)} can have only twopossible values:

if {circumflex over (Q)} is exact, then {circumflex over (R)}=R, i.e.,{circumflex over (R)} is also exact; or   (111)

{circumflex over (Q)}=(Q−1)

{circumflex over (R)}=X 31 (Q−1)D=(X−Q·D)+D=R+D   (112)

In other words, in the relatively infrequent case

, performing a sign-detection on ({circumflex over (R)}−D) is guaranteedto identify the correct Q and R in all cases. (if ({circumflex over(R)}−D) is +ve, then it is clear that {circumflex over (Q)}underestimated Q by a ulp; otherwise {circumflex over (Q)}=Q)

§4.5.4 Estimation of the Delay of and the Memory Required for the QFSAlgorithm

§4.5.4.A Delay Model and Latency Estimation

We assume dedicated h/w implementation of all channels (including theextra channels). Within each channel the look-up tables are alsoimplemented in h/w (note that the tables need not be “writable”). Alltables are independently readable in parallel with a latency of O(lgn).Likewise, since each component modulus is small as well as the number ofchannels (K) is also small, we assume that a dedicated adder-tree isavailable in each channel for the accumulations modulo thecomponent-modulus for that channel. The latency of the accumulations canbe also shown to be logarithmic in the word-length, i.e., O(lgn).Likewise, we assume that a fast, multistage or barrel shifter isavailable per channel so that delay of “variable: shifts is also O(lgn).

FIG. 7 illustrates a timing diagram showing the sequence of successivetime-blocks in which the various steps of the QFS algorithm getexecuted. At the top of each block, we have also shown its latency as afunction of (the overall RNS word-length) n, under the assumptionsstated above.

Since the maximum latency of any of the blocks is O(lgn), theoverall/total latency of the h/w implementation is estimated to beO(lgn).

§4.5.4.B Memory Requirements

In addition to the reconstruction table, we also need the QuotientTables. The total number of number of entries in both parts of theQuotient table is O(K²/2)+O(K−1)=O(K²). In this case, each table entryhas K+1 components, wherein, each component is no bigger than O(lglgK)bits. Consequently the total storage (in bits) that is required is≈O(K³lglgK) bits ≈O(n³lglgn) bits.

§4.6 Modular Exponentiation Entirely within the Residue Domain

Modular exponentiation refers to evaluating (X^(Y) mod D). In manyinstances, in addition to D, the exponent Y is also known ahead of time(ex: in the RSA method, Y is the public or private-key). Our method doesnot need Y to be a constant, but we assume that it is a primary/externalinput to the algorithm and hence available in any desired format (inparticular, we require the exponent Y as a binary integer, i.e., astring of w-bits).

$\begin{matrix}\begin{matrix}{{{Let}\mspace{14mu} Y} = {{y_{w - 1}2^{w - 1}} + {y_{w - 2}2^{w - 2}} + \ldots + {y_{2}2^{2}} + {y_{1}2^{1}} + y_{0}}} \\{= {\left( \mspace{14mu} {{\ldots \mspace{14mu} \left( {{\left( {{y_{w - 1}*2} + y_{w - 2}} \right)*2} + y_{w - 3}} \right)*2} + \ldots + y_{0}} \right)(114)}}\end{matrix} & (113)\end{matrix}$

To the best of our knowledge, one of the fastest methods to performmodular exponentiation expresses the exponent Y as polynomial of radix2, parenthesized as shown Eqn (114) above (known as the “Horner'smethod” of evaluating a polynomial). Since the coefficient of theleading-term in (113) must be non-zero, (i.e. y_(w−1)=1), the modularexponentiation starts with the initial value Ans:=X² mod D. If y_(w−2)≠0then the result is multiplied (modulo-D) by X and is then squared. Thisoperation is repeatedly performed in a loop as shown below:

# Initialization:   Ans = X² mod D; mod_red_1 # Loop: for i from w − 2by −1 to 1 do  curbit := Y[i];  if curbit = 1 then    Ans := (Ans × X)mod D ; mod_red_2  fi;  Ans := (Ans)² mod D; mod_red_3 od; if (y0 = 1)then  Ans = (Ans × X) mod mod_red_4 D; fi;

The obvious speedup mechanism is to deploy the QFS algorithm to realizeeach modular-reduction, aka, remaindering operation. (the remainderingoperations needed in modular-exponentiation are tagged with the label“mod_red_n” inside a box at the end of the corresponding line in themaple-style pseudo-code above).

§4.6.1 Further Optimization: Avoiding Sign-Detection at the End of QFS

Result 3: Directly using the estimate {circumflex over ({circumflex over(Q)})} to evaluate {circumflex over (R)} as a residue-touple (as per Eqn(109) above), corresponds to an estimated integer-remainder {circumflexover (R)} that is in the same residue class (w.r.t. the Divisor D) asthe correct remainder R

Proof: Immediately follows from the definition of the residue class:

Definition 1: Integers p and q are in the same residue class w.r.t. D if(p mod D=q mod D)

Eqns (111) and (112) show that {circumflex over (R)} ∈ {R, R+D}

it is in the same residue-class as the exact integer remainder R.

Next, we show that as long as the range of the RNS system issufficiently large, it is possible to use incorrect values for theremainder at intermediate steps of modular exponentiation, (as long asthey are in the proper residue class); and still generate the correctfinal result.

Result 4: If the inputs X₁ and X₂ to the QFS algorithm are in the sameresidue class w.r.t. the (constant/known) divisor D then the remainderestimates {circumflex over (R)}₁ and {circumflex over (R)}₂ evaluatedusing the quotient estimates {circumflex over (Q)}₁ and {circumflex over(Q)}₂ returned by the QFS algorithm both satisfy the constraints

{circumflex over (R)} ₁ can assume only one of the two values:{circumflex over (R)}₁ =R or {circumflex over (R)} ₁ =R+D   (115)

{circumflex over (R)} ₂ can assume only one of the two values:{circumflex over (R)}₂ =R or {circumflex over (R)} ₂ =R+D   (116)

where R is the correct/exact integer remainder. (this holds even if the“Q_is_exact” flag is set to 0, indicating that the algorithm could notdetermine whether or not the quotient estimate equals the exactquotient).

Result 5: If the range of the RNS is sufficiently large, then there isno need for a sign-detection at the end of the QFS algorithm in order toidentify the correct remainder in intermediate steps during themodular-exponentiation operation.

Proof: Assume that at the end of some intermediate step i, {circumflexover (Q)}=(Q−1) thereby causing

Ans _(i) :={circumflex over (R)} _(i) =R _(i) +D instead of the correctvalue Ans _(i) :={circumflex over (R)} _(i) =R _(i);   (117)

Then, as seen in the pseudo-code for modular exponentiation (which isillustrated in Section §4.6.2) above, the next operation is either amodular-square or a modular-multiplication:

Ans _((i+1)):=(Ans _(i))² mod D or Ans _((i+1)):=(Ans_(i) ×X) mod D

  (118)

Ans _((i+1)):=(R _(i) +D)² mod D or Ans _((i+1)):=(R_(i) +D)×X mod D  (119)

instead of the correct values

Ans _((i+1)) :=R _(i) ² mod D or Ans _((i+1)):=(R _(i) ×X) mod D

However, note that

R_(j) ² is in the same residue class w.r.t. D as (R_(i)+D)² and   (120)

[(R_(i)+D)×X] is in the same residue class w.r.t. D as (R_(i)×X)   (121)

Therefore from claim 2 above, it follows that in either paths(modular-square or modular-product-by-X) the answers obtained at the endof the next step satisfy the exact same constraints, specified byEquations (115) and (116), independent of whether the answers(remainders) at the end of the previous step were exact or had an extraD in them; which shows that performing a sign-detection on the{circumflex over (Q)} returned by the QFS algorithm is not necessary.

Result 6: A single precision RNS range

3D, and correspondingly a double-precision range

9D² is sufficient to obviate the need for a sign-detection

Proof: Since the correct remainder satisfies the constraints 0

R<D, it is clear that the erroneous remainder value (R+D) satisfies

0<D

(R+D)<2D   (122)

As a result, the estimated remainder could be as high as about/almost2D. We therefore set the single-precision range-limit to be 3D so thatthe full double length values could be as large as (3D)²=9D².Accordingly, we select K-smallest-consecutive prime numbers such thattheir product exceeds 9D². With this big a range, either modular-squareor modular-multiplication using an inexact remainder does not causeoverflow, as per constraint (122) above

§4.6.2 The ME-FWRD Algorithm: Maple-Style Pseudo-Code

# First we specify a procedure (“proc” in maple) which is a smallwrapper around the QFS algorithm QFS _rem_estimate := proc( X,(X modm_(e)))   {circumflex over (Q)}, {circumflex over(Q)}_mod_me, Q_is_exact := Quotient_First_Scaling_Estimate( X, 

X mod m_(e )

)  R_is_exact := Q_is_exact;  # if {circumflex over (Q)} is exact thenso is {circumflex over (R)}  {circumflex over (R)}_mod_me := [X modm_(e) − {circumflex over (Q)}_mod_me ×(D mod m_(e))] modm_(e);  # bootstrapping...   {circumflex over (R)} = X 

  ( {circumflex over (Q)} 

   D);  Return( {circumflex over (R)}, {circumflex over(R)}_mod_me, R_is_exact);  end proc ; Algorithm ModExp_Fully_Within_Residue_Domain ( X,(X mod m_(e)), Y)  #Inputs: X as a residue-touple, the extra-info, and Y as a w-bitbinary-number  # We assume that the constraint X < M has been enforcedbefore converting the primary input X into  # a residue-touple  #Pre-computations: moduli where

 

 9D², D_mod_me, D≡ residue-touple for D,...  # and everything requiredby the QFS algorithm  # Initializations   Ans, Ans_mod_me, Ans_is_exact:= qfs_rem_estimate( X, 

X mod m_(e)

) ;   Ans:= Ans 

   Ans;  Ans_mod_me := (Ans_mod_me)² mod m_(e)   #bootstrapping...  Ans, Ans_mod_me, Ans_is_exact := qfs_rem_estimate( Ans, Ans_mod_me) ; for i from w − 2 by −1 to 1 do   #Loop    curbit := Y[i];    if curbit= 1 then       Ans:= Ans  

   X;  Ans_mod_me := (Ans_mod_me × X_mod_me) mod m_(e);       Ans,Ans_mod_me, Ans_is_exact := qfs_rem_estimate( Ans, Ans_mod_me) ;    fi;    Ans:= Ans  

   Ans;   Ans_mod_me := (Ans_mod_me)² mod m_(e) ;     Ans, Ans_mod_me,Ans_is_exact := qfs_rem_estimate( Ans, Ans_mod_me) ;  od;  if (y₀ = 1)then    # Ans = (Ans × X) mod D     Ans:= Ans 

   X;  Ans_mod_me := (Ans_mod_me × X_mod_me) mod m_(e) ;     Ans,Ans_mod_me, Ans_is_exact := qfs_rem_estimate( Ans, Ans_mod_me) ;  fi;   # Outputs: remainder-touple, extra-info, exactness-flag  Return( Ans,Ans_mod_me, Ans_is_exact); □  End_Algorithm

Correctness of the algorithm follows from the analytical resultspresented so far. Moreover the algorithm was implemented in Maple andextensively tested on a large number of cases.

§4.6.3 Delay Estimation of the Proposed Modular-Exponentiation Algorithm

Pre-computation costs are not considered (they represent one-time fixedcosts).

(i) The main/dominant delay is determined by the delay of the loop.

Assuming that the exponent Y is about as big as D, the number of timesthe exponentiation loop is executed=lgY≈O(n) times.

(ii) Determination of the Quotient estimate is the most time-consumingoperation in each iteration of the loop and it requires O(lgn) delay (asexplained in Section §4.5.4-A).

As a result, each iteration of the loop requires (O(lgn) delay.

(iii) Therefore, the total delay is O(nlgn).

The memory requirements are exactly the same as those of the QFSalgorithm: ≈O(n³lglgn) bits as shown above (in Section §4.5.4-B).

§4.6.4 Some Remarks about the ME-FWRD Algorithm

{circle around (1)} In a remaindering operation, it is possible tounder-estimate the quotient, but it is not acceptable to over-estimatethe quotient even by a ulp for the following reason:

if {circumflex over (Q)} is an over-estimate, then {circumflex over(R)}=X−{circumflex over (Q)}×D

0 and therefore gets evaluated as   (123)

{circumflex over (R)}≡

−|{circumflex over (R)}| which is not in the same residue class w.r.t. Das the correct remainder R   (124)

{circle around (2)} The algorithm always works in full (double)precision mode. In the RNS, increased word length simply requires somemore channels. In a dedicated h/w implementation, all the channels canexecute concurrently, fully leveraging the parallelism inherent in thesystem. Hence, the incremental delay (as a result of doubling theword-length) is minimal: Since doubling the word-length adds one-levelto each adder/accumulation-tree (within each RNS-channel),

the incremental delay is ≈O(1).

§4.7 Convergence Division via Reciprocation to Handle Arbitrary, DynamicDivisors

let X be a 2n bit dividend

X=X _(i)·2^(n) +X _(l)   (125)

where X_(u) is the upper-half (more-significant n bits) and X_(l) is thelower-half. Let D be an n bit-long divisor. Then, the quotient Q is

$\begin{matrix}{\begin{matrix}{Q = \left\lfloor \frac{X}{D} \right\rfloor} \\{= \left\lfloor \frac{{X_{u} \cdot 2^{n}} + X_{l}}{D} \right\rfloor} \\{{= {\left\lfloor \frac{X_{u} \cdot 2^{n}}{D} \right\rfloor + \left\lfloor \frac{X_{l}}{D} \right\rfloor + \delta}}\mspace{11mu}}\end{matrix}{{{wherein}\mspace{14mu} \delta} = \left\{ {0,1} \right\}}} & (126)\end{matrix}$

Since X _(l) and D are both n bit long numbers X _(l)<2D

  (127)

$\begin{matrix}{\left\lfloor \frac{X_{l}}{D} \right\rfloor = \left\{ \begin{matrix}0 & {{{if}\mspace{14mu} X_{l}} < D} \\1 & {otherwise}\end{matrix} \right.} & (128)\end{matrix}$

The remaining term is

$\begin{matrix}{{{\left\lfloor \frac{X_{u} \cdot 2^{n}}{D} \right\rfloor \mspace{14mu} {wherein}\mspace{14mu} \frac{X_{u} \cdot 2^{n}}{D}} = \frac{X_{u}}{D_{f}}}{{{where}\mspace{14mu} D_{f}} = \left. \frac{D}{2^{n}}\Rightarrow{\frac{1}{2}D_{f} < 1} \right.}} & (129)\end{matrix}$

In the inequality above, the lower bound ½ follows from the fact thatthe leading bit of an n-bit long number D is 1 (if not, the word-lengthof D would be smaller than n). Also note that the maximum value of then-bit integer D can be (2^(n)−1), which yields the upper bound 1 onD_(f).

$\begin{matrix}{{{Let}\mspace{14mu} D_{f\_ inv}} = {\left. \frac{1}{D_{f}}\Rightarrow{1 < D_{f\_ inv}{2\mspace{14mu} {and}\mspace{14mu} {let}\mspace{14mu} D_{f\_ inv}}} \right. = {{1 + {F\mspace{14mu} {where}\mspace{14mu} 0}} < F1}}} & (130) \\{{Then},{\frac{X_{u}}{D_{f}} = {{x_{u} \times D_{f\_ inv}} = {{X_{u}\left( {1 + F} \right)} = {\left. {X_{u} + {X_{u}F}}\Rightarrow\left\lfloor \frac{X_{u}}{D_{f}} \right\rfloor \right. = {X_{u} + \left\lfloor {X_{u}F} \right\rfloor}}}}}} & (131)\end{matrix}$

From the last equality it is clear that in order to correctly evaluate└X_(u)D_(f) _(—) _(inv)┘, the value of F (which is the fractional partof D_(f) _(—) _(inv)) must be evaluated upto at least n bits ofprecision.

To evaluate D_(f) _(—) _(inv), let

$\begin{matrix}{Y = \left. \left( {1 - D_{f}} \right)\Rightarrow{0 < Y\frac{1}{2}} \right.} & (132)\end{matrix}$

Note that the integer Y _(int)=(2^(n) Y)=(2^(n) −D)   (133)

(134)

Substituting D_(f) in terms of Y, yields

$\begin{matrix}\begin{matrix}{D_{f\_ inv} = \frac{1}{D_{f}}} \\{= \frac{1}{\left( {1 - Y} \right)}} \\{= \frac{\left( {1 + Y} \right)}{\left( {1 - Y^{2}} \right)}} \\{= \frac{\left( {1 + Y} \right)\left( {1 + Y^{2}} \right)}{\left( {1 - Y^{4}} \right)}} \\{= \ldots} \\{= \frac{\left( {1 + Y} \right)\left( {1 + Y^{2}} \right)\mspace{14mu} \ldots \mspace{14mu} \left( {1 + Y^{(2^{t})}} \right)}{\left( {1 - Y^{(2^{2t})}} \right)}}\end{matrix} & (135)\end{matrix}$

In the last set of equalities, since the numerator and the denominatorof each successive expression are both multiplied by the same

factor of the form (1+Y ⁽² ^(i) ⁾) at step i, i=0, 1,   (136)

the original value of the reciprocal does not change. Also note thateach successive multiplication by a factor of the above form doubles thenumber of leading ones in the denominator. As a result the denominatorin the successive expressions in Eqn (135) approaches the value 1 frombelow (it becomes 0.11111 . . . ).

It is well known [6] that

when the number of leading-ones in the denominator exceeds theword-length (i.e., n bits),

the error ε in the numerator also satisfies the bound ″ε|<2^(−n)

and the iterations can be stopped. In other words, when t leads to thesatisfaction of the constraint

$\begin{matrix}\left. {\left( {1 - Y^{(2^{2t})}} \right)\left( {1 - 2^{(n)}} \right)}\Rightarrow{{({lgY})2^{({2t})}}n}\Rightarrow{t{\frac{1}{2}\left( {{lgn} - {lglgY}} \right)}} \right. & (137)\end{matrix}$

the iterations can be stopped and the approximation

$\begin{matrix}\begin{matrix}{D_{f\_ inv} = \frac{\left( {1 + Y} \right)\left( {1 + Y^{2}} \right)\mspace{14mu} \ldots \mspace{14mu} \left( {1 + Y^{(2^{t})}} \right)}{\left( {1 - Y^{(2^{2t})}} \right)}} \\{\approx \frac{\left( {1 + Y} \right)\left( {1 + Y^{2}} \right)\mspace{14mu} \ldots \mspace{14mu} \left( {1 + Y^{(2^{t})}} \right)}{1}}\end{matrix} & (138)\end{matrix}$

can be used. Thus, number of iterations in a convergence division isO(lg).

In contrast any digit-serial division fundamentally requires O(n) steps.

It turns out that the above convergence method is equivalent tonewton-style convergence iterations (for details, please see anytextbook on Computer Arithmetic [6,7]). Newton's method is quadraticwhich means that the error

ε_(n+1) after the (n+1)th iteration ≈O((ε_(n))²)   (139)

which in-turn implies that the number of accurate bits doubles aftereach iteration (which is why convergence-division is the method ofchoice in high speed implementations).

From (138) it is clear that

D _(f) _(—) _(inv)≈(1+Y)(1+Y ²) . . . . (1+Y ⁽² ^(t) ⁾)

Accordingly the products need to be accumulated, so as to yield aprecision of 2^(n)-bits at the end. Since a product of two n bit numbers(which includes a square) can be upto 2n bits long, the lower half ofthe double length product must be discarded retaining only the n mostsignificant bits at every step. Each such retention of n mostsignificant bits is tantamount to division by a constant, viz., 2^(n).Thus the QFS algorithm needs to be invoked at every step in theconvergence division method. In general the SODIS algorithm is alsoneeded at each step. Those familiar with the art will appreciate thatusing all the preceding algorithms unveiled herein, an ultrafastconvergence division via reciprocation can be realized without everhaving to leave the residue domain at any intermediate step.

§5 Application Scenarios

The difficulty of implementing some basic canonical operations such asbase-extension, sign-detection, scaling, and division has prevented thewidespread adoption of the RNS. The algorithms and apparata unveiledherein streamline and expedite all these fundamental operations, therebyremoving all roadblocks and unleashing the full potential of the RNS.Since a number representation is a fundamental attribute that underliesall computing systems, expediting all arithmetic operations using theRNS can potentially affect all scenarios that include computing systems.The scenarios that are most directly impacted are listed below.

-   -   {circle around (1)} At long wordlengths the proposed system        yields substantially faster implementations. Therefore,        cryptographic processors are likely to adopt the RNS together        with the algorithms unveiled in this document. All other long        wordlength applications (such as running Sieves for factoring        large numbers or listing the prime numbers within a given        interval, etc) will substantially benefit from hardware as well        as software implementations of the proposed number system and        the accompanying algorithms    -   {circle around (2)} Digital Signal Processing is dominated by        multiply and add operations. The proposed representation is        therefore likely to be adopted in DSP processors.        -   Scaling is particularly easy if the scaling factor is            divisible by one or more of the component moduli.        -   My method of selecting moduli uses all prime numbers up to a            certain threshold value. So there is ample scope to select a            scale factor that is divisible by one or more of the moduli.            This is an added advantage of the method of moduli selection            that I have adopted.    -   {circle around (3)} Ultra-fast counters, constant-time or        wordlength-independent up/down counters are significantly faster        as well as simpler to realize when the RNS system is used.        -   Consequently, the theory as well as designs of such counters            should switch over to using the RNS and the accompanying            algorithms    -   {circle around (4)} Memory and cache organization/access is        another potentially significant application area. Conceptually,        memory needs be abstracted as though it were a “linear” storage        because the indexing calculations in conventional        (binary/decimal) number representations are easier when the        memory is logically organized linearly. Adopting the RNS allows        a different conceptual organization of storage resources (ex:        under RNS, storage can be conceptualized as a collection of        buckets)    -   {circle around (5)} Realization of hash functions is faster and        easier when there is native/hardware support for modulo        operations that are required in the RNS. That in turn opens up        other possibilities to further streamline and expedite other        algorithms and/or apparata (such as Bloom filters, for        instance).    -   {circle around (6)} Coding Theory and practice have pretty much        revolved around conventional number representations. RN systems        offer a rich mix of choices to further improve coding theory and        practice.    -   {circle around (7)} Hardware implementations based on the RNS        are inherently more parallel, since all channels can do their        processing independently, thereby increasing the “locality” of        processing and drastically reducing long interconnects. This in        turn makes the circuits more compact and faster while        simultaneously requiring substantially lower amount of power        (than equivalent circuits based on conventional number        representations). Moreover, the independence of channels makes        the hardware lot easier to test (VLSI testing is a critically        important issue). Finally, hardware realizations based on the        RNS are more reliable.    -   {circle around (8)} Even when the RNS is implemented in        software, it can still improve the utilization of the multi-core        processors today. Channel(s) in the RNS could be dynamically        mapped onto threads which could in turn be dynamically allocated        and executed on any of the multiple cores.    -   {circle around (9)} Random number generation based on RNS system        is another area that appears to have a great potential.    -   {circle around (10)} Theoretical Issues: for instance the        Remainder Theorem constitutes an “orthogonal” decomposition (in        a sense analogous to the Discrete Fourier Transform, i.e., the        DFT). This is why multiplication (which is a convolution)        becomes so simple, all the cross-terms go away . . . .        -   What kind of redundancy is there in RNS representation?        -   The novelty of the methods unveiled herein lies in their use            of both the integer as well as fractional parts rather than            sticking to only one. Can such methods be further extended            and applied to other well know hard problems ? are these            methods related to “Interior Point Methods”?

5.1 Distinctions and Novel Aspects of this Invention

-   -   {circle around (1)} All of the algorithms make maximal use of        those intermediate variables whose values can be expressed as        the most significant digits of a computation (the reader can        verify that this is the case in the “RPPR”, SODIS, as well as        the QFS algorithm)        -   This enables the use of approximation methods.    -   {circle around (2)} Accuracy of approximation is in turn related        to the precision required. The algorithms therefore use the        minimal amount of precision necessary for the computation.        -   I leverage the rational domain interpretation (i.e., joint            integer as well as fractional domain interpretations) of the            Chinese Remainder Theorem in order to drastically reduce the            precision of the fractional values that need to be            pre-computed and stored in look-up tables. It turns out that            a drastic reduction of precision from n-bits to ┌lgn┐ bits            still allows a highly accurate estimation of some canonical            intermediate variables, wherein the estimate can be off only            by a ulp. In other words, the exact value of the computation            can be narrowed down to a pair of successive integers. This            strategy is adopted in all the methods (viz, RPPR, SODIS, as            well as the QFS algorithms)        -   The novel “disambiguation” step then selects the right            answer(by disambiguating between the two choices) in all            cases.        -   In other words, in a fundamental sense, I have identified            the optimal mix of which and how much information from the            fractional domain needs be combined with which specific            portion of the information available from the integer-domain            interpretation of the CRT in order to achieve ultrafast            execution; and developed new methods that fully exploit that            optimal mix.    -   {circle around (3)} My Moduli selection method simultaneously        achieves three optimizations:        -   O1: It maximizes the amount of pre-computation to the            fullest extent; making it possible to deploy exhaustive            look-up tables that cover all possible input cases.        -   O2: Simultaneously, it also minimizes the amount of memory            required (otherwise an exhaustive look-up would not be            feasible at long bit-lengths).        -   O3: It minimizes the size that each individual            component-modulus nt_(i) can assume. The net effect is that            the RNS is realized via a moderately large number of            channels, each of which has a very small modulus.        -   In other words, the moduli-selection brings out the            parallelism inherent in the RNS to the fullest extent.    -   {circle around (4)} The said moduli selection therefore leads to        two critical benefits        -   B+1: Exhaustive pre-computation implies that there is very            little left to be done at run-time, which leads to ultrafast            execution (a good example is the Quotient First Scaling            (QFS) algorithm).        -   B+2: Exploiting the parallelism to the fullest extent while            also using the smallest amount of memory further speeds up            execution and cuts down on area and power consumption of            hardware realizations.    -   {circle around (5)} All of the prior works published hitherto        had a narrower focus. For example, [9], [12] (and their        derivatives that have appeared since) are mainly concerned with        “base-extension”. On the other hand, Vi's first paper [16] was        aimed at “fault detection in flight control systems”; while the        follow-on journal paper [17] was focused on “sign-detection”.        Likewise, Lu's work [25,26] was mainly focused on a more        efficient sign-detection and its application to division in the        RNS.        -   In contrast, we have developed a unified framework that            expedites all the difficult RNS operations simultaneously.    -   {circle around (6)} The algorithms can be implemented in        software (wherein the computation within each channel is done        within a separate thread and the multiple threads get        dynamically mapped onto different cores in a multi-core        processor) or in hardware. In either case they offer a wide        spectrum of choices that trade-off polynomially increasing        amounts of pre-computations and look-up-table memory to achieve        higher speed. In other words the algorithms are flexible and        allow the designer a wide array of choices for deployment.

FIG. 9 is a flow chart representation of a process 900 of performingreconstruction using a residue number system. At box 902, a set ofmoduli is selected. At box 904, a reconstruction coefficient isestimated based on the selected set of moduli. At box 906, areconstruction operation using the reconstruction coefficient isperformed. As previously discussed, in some designs, additionaloperations may also be performed using the reconstruction operation.

In some designs, the operation of selecting the set of moduli is done soas to enable an exhaustive pre-computation and look-up strategy thatcovers all possible inputs. In some designs, the determination ofreconstruction coefficient may be performed in hardware such that thedetermination is upper limited by delay of O(log n) where n is aninteger number representing wordlength.

FIG. 10 is a block diagram representation of a portion of an apparatus1000 for performing reconstruction using a residue number system. Themodule 1002 is provided for selecting a set of moduli. The module 1004is provided for estimating a reconstruction coefficient based on theselected set of moduli. The module 1004 is provided for performing areconstruction operation using the reconstruction coefficient.

FIG. 11 is a flow chart representation of a process 1100 of performingdivision using a residue number system. At box 1102, a set of moduli isselected. At box 1104, a reconstruction coefficient is determined. Atbox 1106, a quotient is determined using an exhaustive pre-computationand a look-up strategy that covers all possible inputs.

FIG. 12 is a block diagram representation of a portion of an apparatus1200 for performing division using a residue number system. The module1202 is provide for selecting a set of moduli. The module 1204 isprovided for determining a reconstruction coefficient. The module 1206is provided for determining a quotient using an exhaustivepre-computation and a look-up strategy that covers all possible inputs.

FIG. 13 is a flow chart representation of a process 1300 of computing amodular exponentiation using a residue number system. At box 1302,iterations are performed without converting to a regular integerrepresentation, by performing modulator multiplications and modularsquaring. At box 1304, the modular exponentiation is computed as aresult of the iterations.

FIG. 14 is a block diagram representation of a portion of an apparatus1400 for computing a modular exponentiation using a residue numbersystem. The module 1402 is provided for iterating, without converting toa regular integer representation, by performing modular multiplicationsand modular squaring. The module 1404 is provided for computing themodular exponentiation as a result of the iterations.

It is noted that in one or more exemplary embodiments described herein,the functions and modules described may be implemented in hardware,software, firmware, or any combination thereof. If implemented insoftware, the functions may be stored on or transmitted over as one ormore instructions or code on a computer-readable medium.Computer-readable media includes both computer storage media andcommunication media including any medium that facilitates transfer of acomputer program from one place to another. A storage media may be anyavailable media that can be accessed by a computer. By way of example,and not limitation, such computer-readable media can comprise RAM, ROM,EEPROM, CD-ROM or other optical disk storage, magnetic disk storage orother magnetic storage devices, or any other medium that can be used tocarry or store desired program code in the form of instructions or datastructures and that can be accessed by a computer. Also, any connectionis properly termed a computer-readable medium. For example, if thesoftware is transmitted from a website, server, or other remote sourceusing a coaxial cable, fiber optic cable, twisted pair, digitalsubscriber line (DSL), or wireless technologies such as infrared, radio,and microwave, then the coaxial cable, fiber optic cable, twisted pair,DSL, or wireless technologies such as infrared, radio, and microwave areincluded in the definition of medium. Disk and disc, as used herein,includes compact disc (CD), laser disc, optical disc, digital versatiledisc (DVD), floppy disk and blue-ray disc where disks usually reproducedata magnetically, while discs reproduce data optically with lasers.Combinations of the above should also be included within the scope ofcomputer-readable media.

As utilized in the subject disclosure, the terms

{hacek over (r)}system,

ś

{hacek over (r)}module,

ś

{hacek over (r)}component,

ś

{hacek over (r)}interface,

ś and the like are likewise intended to refer to a computer-relatedentity, either hardware, a combination of hardware and software,software, or software in execution. Components can include circuitry,e.g., processing unit(s) or processor(s), that enables at least part ofthe functionality of the components or other component(s) functionallyconnected (e.g., communicatively coupled) thereto. As an example, acomponent may be, but is not limited to being, a process running on aprocessor, a processor, a machine-readable storage medium, an object, anexecutable, a thread of execution, a program, and/or a computer. By wayof illustration, both an application running on a computer and thecomputer can be a component. One or more components may reside within aprocess and/or thread of execution and a component may be localized onone computer and/or distributed between two or more computers.

The aforementioned systems have been described with respect tointeraction between several components and modules. It can beappreciated that such systems, modules and components can include thosecomponents or specified sub-components, some of the specified componentsor sub-components, and/or additional components, and according tovarious permutations and combinations of the foregoing. Sub-componentsalso can be implemented as components communicatively coupled to othercomponents rather than included within parent component(s).Additionally, it should be noted that one or more components may becombined into a single component providing aggregate functionality ordivided into several separate sub-components and may be provided tocommunicatively couple to such sub-components in order to provideintegrated functionality. Any components described herein may alsointeract with one or more other components not specifically describedherein but generally known by those of skill in the art.

Moreover, aspects of the claimed subject matter may be implemented as amethod, apparatus, or article of manufacture using standard programmingand/or engineering techniques to produce software, firmware, hardware,or any combination thereof to control a computer or computing componentsto implement various aspects of the claimed subject matter. The term“article of manufacture” as used herein is intended to encompass acomputer program accessible from any computer-readable device, carrier,or media. For example, computer readable media can include but are notlimited to magnetic storage devices (e.g., hard disk, floppy disk,magnetic strips, optical disks (e.g., compact disk (CD), digitalversatile disk (DVD), smart cards, and flash memory devices (e.g., card,stick, key drive. Additionally it should be appreciated that a carrierwave can be employed to carry computer-readable electronic data such asthose used in transmitting and receiving voice mail or in accessing anetwork such as a cellular network. Of course, those skilled in the artwill recognize many modifications may be made to this configurationwithout departing from the scope or spirit of what is described herein.

What has been described above includes examples of one or moreembodiments. It is, of course, not possible to describe everyconceivable combination of components or methodologies for purposes ofdescribing the aforementioned embodiments, but one of ordinary skill inthe art may recognize that many further combinations and permutations ofvarious embodiments are possible. Accordingly, the described embodimentsare intended to embrace all such alterations, modifications andvariations that fall within the spirit and scope of the appended claims.Furthermore, to the extent that the term

{hacek over (r)}includes

ś is used in either the detailed description or the claims, such term isintended to be inclusive in a manner similar to the term

{hacek over (r)}comprising

ś as

{hacek over (r)}comprising

ś is interpreted when employed as a transitional word in a claim.

1. A method of performing reconstruction using a residue number system,comprising; selecting a set of moduli; estimating a reconstructioncoefficient based on the selected set of moduli; and performing areconstruction operation using the reconstruction coefficient.
 2. Themethod of claim 1, wherein the selecting the set of moduli is done so asto enable an exhaustive pre-computation and look-up strategy that coversall possible inputs.
 3. The method of claim 1, wherein thereconstruction coefficient is determined in a delay limit of O(log n).4. The method of claim 1 wherein the estimating comprises: computing aplurality of reconstruction reminders; and quantizing the plurality ofreconstruction reminders.
 5. The method of claim 4 wherein thequantization comprises: expressing the reconstruction remainders asproper fractions; pre-computing the proper fractions in a pre-determinedradix b; truncating the proper fractions to a precision of no more than(┌log_(b) log_(b)

┐) radix-b fractional digits; scaling the truncated proper fractions bya scale factor so that multiplication by the scale factor simply amountsto a left-shifting of base-b digits and yields an integer value; andstoring the resulting integer values in look-up tables, wherein each RNSchannel 1, with component-modulus m_(i) requires one look-up table with(m_(i)−1) entries.
 6. The method of claim 5, wherein, channel look-uptables are read-only and are accessed completely independently of oneanother
 7. The method of claim 1 wherein all the operands are integersand all the arithmetic operations are carried out with an ultra-lowprecision of no more than (┌log_(b) K┐+┌log_(b) log_(b)M┐) radix-bdigits.
 8. The method of claim 1 wherein the estimate consists of a pairof consecutive integers, one of which is the correct value of thereconstruction coefficient.
 9. The method of claim 8, wherein, adisambiguation step is required to select the correct answer from amongthe two choices, by using an independent extra bit of information whichis maintained in the form of one extra residue, i.e., remainder, withrespect to an extra “disambiguator-modulus” m_(e) that satisfies thecondition: gcd(

, m_(e))<m_(e)
 10. The method of claim 9, wherein, a systematic“disambiguation-bootstrapping” process is required (and is thereforeadopted) to ensure that this extra remainder is always available for anyvalue that the method encounters.
 11. A method of performing divisionusing a residue number system, comprising: selecting a set of moduli;determining a reconstruction coefficient; and determining a quotientusing an exhaustive pre-computation and a look-up strategy that coversall possible inputs.
 12. The method of claim 11, wherein, thedisambiguation bootstrapping information regarding the determinedquotient Q is also computed.
 13. A method of computing a modularexponentiation in a residue number system, comprising: iterating,without converting to a regular integer representation, by performingmodular multiplications and modular squaring; computing the modularexponentiation as a result of the iterations.
 14. The method of claim13, wherein there is no conversion between distinct moduli sets within aresidue domain at any intermediate step throughout the computingprocess.
 15. An apparatus for performing reconstruction using a residuenumber system, comprising: means for selecting a set of moduli; meansfor estimating a reconstruction coefficient based on the selected set ofmoduli; and means for performing a reconstruction operation using thereconstruction coefficient.
 16. A computer program product comprising anon-volatile, computer-readable medium, storing computer-executableinstructions for performing reconstruction using a residue numbersystem, the instructions comprising code for: selecting a set of moduli;estimating a reconstruction coefficient based on the selected set ofmoduli; and performing a reconstruction operation using thereconstruction coefficient.